EVALUATION
One thread is in _thread_new, however, the creating thread has already called
os::start_thread() on it. Looks like the start thread event got lost.
###@###.### 2002-03-12
There is another type of hang in vmark. VM thread couldn't grab the
Threads_lock in SafepointSynchronize::begin(), although the _owner
of Threads_lock is 0x0. Looking into the pthread_mutex_lock frames,
it appears the underlying _mutex of Threads_lock is indeed locked
by some thread. This probably will need a different bugid.
To reproduce the hang:
> java COM.volano.Main
> repeat 1000 java COM.volano.Mark -count 1
###@###.### 2002-03-14
I am tracking the second type hang with bug id 4654490
###@###.### 2002-03-18
Both this hang and 4654490 are caused by a bug in 2.4 SMP kernel. It appears
2.4 SMP kernel sometimes may hand out duplicate PID if two processes are
creating threads at the same time. Indeed, I can reproduce the problem
of duplicate PID with a C testcase just using "fork".
Note that each thread on Linux is essentially a process and must have a
unique PID. If two threads are created with the same PID, signals that
are meant to start a newly created thread or to wake up a thread blocked in
pthread_mutex_lock() or pthread_cond_wait() may get delivered to the
wrong thread (LinuxThreads uses "kill(PID, )" to implement pthread_kill()
and to restart a sleeping thread). If that happens, we may end up with a
hanging VM because some of its threads never wake up.
In the Java testcase, when VMark hangs, I can see duplicate PIDs with
this command:
[root@jtg-linux1 /root]# ps -A|sort|uniq -D
26829 ? 00:00:00 java
26829 ? 00:00:00 java
It looks like this kernel race has been fixed in kernel 2.4.18. The changelog
of 2.4.18 contains:
- Fix SMP race on PID allocation (Erik A. Hendriks)
This hang and 4654490 are not reproducible when vmark is run on kernel 2.4.18.
Note that kernel 2.4.18 is included in Redhat 7.3 beta. If you want to change
to RedHat 7.3, please also see bug 4654443.
###@###.### 2002-03-26
|