|
Quick Lists
|
|
Bug ID:
|
4472904
|
|
Votes
|
188
|
|
Synopsis
|
JDK 1.3.1/1.3 - Deadlock in safepoint code (Java Application hangs)
|
|
Category
|
hotspot:compiler1
|
|
Reported Against
|
1.3.1
, merlin-beta2
|
|
Release Fixed
|
|
|
State
|
11-Closed,
Not Reproducible,
bug
|
|
Priority:
|
4-Low
|
|
Related Bugs
|
4513101
,
4641232
,
4681171
|
|
Submit Date
|
21-JUN-2001
|
|
Description
|
Customer Problem Description:
------------------------------
After 3+ hours of execution, our Java application's VM is hanging. We cannot even signal the VM to do a thread dump. We are able to consistently reproduce this problem by running our application for an extended period of time under a considerable load. This problem is detrimental to the stability of our application and is a top priority for us.
To diagnose the problem, we downloaded to JDK 1.3.0 source and debugged it. The problem is a deadlock in the safepoint synchronization code. At the time of the deadlock, SafepointSynchronize::_waiting_to_block has a value greater than zero, while all threads appear to be at their respective safepoints.
As you know, the compiled safepoint handler copies the currently executing method and replaces almost all calls out of it with illegal instructions, which are later trapped to stop the thread. If the compiled code makes a JVM call, the thread's state changes from _thread_in_native_trans to _thread_in_native_blocked. If the thread's state is _thread_in_native_blocked when SafepointSynchronize::examine_state_of_thread() is called, roll_forward(_at_safepoint) is called. This causes the main thread to decrement _waiting_to_block on behalf of the running thread, because it is assumed to already be in a safepoint. However, in our case, the initial state of the thread is _thread_in_native. The main thread expects the running thread to decrement _waiting_to_block once it enters a safepoint. But, once our thread reaches the safepoint, it is in _thread_in_native_blocked. In this case SafepointSynchronize::block() does not decrement _waiting_to_block. Therefore, the main thread blocks indefinately waiting for _waiting_to_block to decrement to 0.
The application consists of four java processes:
1. A 100% Java home grown message server. This server communicates with
the other components using TCP connections. This process does NOT
experience the deadlock problem.
2. An application server process. This consists of 47 threads. Each
thread is a subcomponent of the system. Each subcomponent requests data
objects from the message server, performs an action on the data, and
outputs processed data objects into the message server. The
subcomponents form a 6 stage pipeline in which work is performed. Some
stages have multiple threads, and there are some application management
threads, so that's how we wind up with 47 of them. This process DOES
experience the deadlock and does so more frequently than #3.
3. A cloudscape database server process. This process DOES experience
the deadlock.
4. A tomcat web server process. This process DOES NOT experience the
deadlock.
We have tested the application against 1.3.1 and we do experience the
deadlock there as well.
Here's some text/explainations I've been putting together for my own sanity. I've included some trace diagnostics. You'll notice some fields that don't look familiar. The most useful of these is _decremented_block.
This is a per-thread variable that gets set whenever a thread decrements
_waiting_to_block. SafepointSynchronize::begin() sets this value to 0
for all threads before the still_waiting loop. If the _waiting_to_block
is decremented on behalf of a thread by roll_forward(_at_safepoint),
_decremented_block is set to 2. If the decrement is done by
SafepointSynchronize::block(), _decremented_block is set to 1. If
_decremented_block_ is 0, the decrement has not been done.
-- attached diagnostic data --
Situation : Compiled code handler
Hang occurred after 3.5 hours of operation running Emissary. Trace log
had the following status repeated:
Threads_lock timeout: 0x1a24a870 [0x42c] _waiting_to_block=1
block_signaled=1 is_synchronizing=1 count=1
THE HANG occurred
num_threads = 30 init_waiting=30 stage2_wtb=2 _waiting_to_block=1
Thread: 0x1a1d42a0 [0x408] State: _compiled_handler ThreadState: 50
pc: 0x98d7e3 _decremented_block=0
SafepointHandler: CompiledCodeSafepointHandler
Thread: 0x1a1e2898 [0x3f8] State: _at_safepoint ThreadState: 11 pc:
0x0 _decremented_block=2
Thread: 0x1a1cbc58 [0x728] State: _at_safepoint ThreadState: 4 pc: 0x0
_decremented_block=2
Thread: 0x1a1c6420 [0x460] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x78b488 [0x868] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x78a818 [0x504] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1a1aec80 [0x718] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1adea8 [0x970] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1ad3d8 [0x63c] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1ac990 [0x208] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1bc4b8 [0x210] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a1ebca0 [0x588] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x8312c8 [0x628] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1a24a590 [0x498] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a24a408 [0x6ec] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x786a60 [0x68c] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x7868f0 [0x540] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1a24a870 [0x42c] State: _call_back ThreadState: 7 pc: 0x0
_decremented_block=1
Thread: 0x7888a8 [0x3b4] State: _at_safepoint ThreadState: 50 pc: 0x0
_decremented_block=2
Thread: 0x786df8 [0x450] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1afb48a8 [0x6b0] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x787618 [0x5c8] State: _at_safepoint ThreadState: 11 pc: 0x0
_decremented_block=2
Thread: 0x835da0 [0x704] State: _at_safepoint ThreadState: 11 pc: 0x0
_decremented_block=2
Thread: 0x1a1a3710 [0x880] State: _at_safepoint ThreadState: 50 pc:
0x0 _decremented_block=2
Thread: 0x1a16ed70 [0x3a8] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x1a2380b8 [0x1f8] State: _at_safepoint ThreadState: 11 pc:
0x0 _decremented_block=2
Thread: 0x8122f0 [0x1e4] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x80f920 [0x144] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread: 0x1a0b0c60 [0x720] State: _at_safepoint ThreadState: 10 pc:
0x0 _decremented_block=2
Thread: 0x7844c0 [0x590] State: _at_safepoint ThreadState: 10 pc: 0x0
_decremented_block=2
Thread 0x42c printed out a timeout message, as part of fix 1. It
appears to be unrelated, since thread 0x408 is the one that did not
decrement waiting_to_block.
Thread 0x408 has the following stack dump:
Reported by Debugger Interpretation
NTDLL! 77f827e8()
KERNEL32! 77e86a3d()
JVM! 08047031() SafepointSynchronize::block()safepoint.cpp line 322
JVM! 0802713a() JNI_GetPrimitiveArrayCriticaljni.cpp line 1451
Java_java_util_zip_CRC32_updateBytes(const JNINativeInterface_ * *
0x1a1d42f4, _jobject * 0x1cfafa6c, long 0, _jobject * 0x1cfafa80, long
0, long 4088) line 38 + 18 bytes
0x408's current state is 50 (_thread_in_native_blocked). The particular
section of SafepointSynchronize:block() in which the thread is blocked
is structured as follows:
switch (state)
{
// Code deleted
case _thread_in_native_trans:
case _thread_blocked_trans:
case _thread_new_trans:
// Code and comments deleted
if (state == _thread_in_native_trans)
thread->set_thread_state(_thread_in_native_blocked);
// Block until the safepoint operation is completed.
Threads_lock->lock_without_safepoint_check();
The thread's state could not have been _thread_in_native_blocked when
the function was entered, since it would not have made it into this
section of the switch statement. Furthermore, the only way it could
have changed to _thread_in_native_blocked, would have been if its value
was _thread_in_native_trans. This state is consistent with the
transition code on the stack below it. Therefore, this thread's state
must have been _thread_in_native_trans when the function was entered.
The thread's safepoint state is recorded as _compiled_handler. This
means that its state was _thread_in_comp_leaf or _thread_in_Java when
examine_thread_state() was called. The _compiled_handler state expects
the thread to decrement _waiting_for_block. However, when
SafepointSynchronize::block() is entered, the state causes a jump to a
block of code that is normally called when the examine_thread_state()
has already called roll_forward(_at_safepoint). Since roll_forward()
automatically decrements _waiting_to_block when the safepoint state is
set to _at_safepoint, block() does not decrement it again.
decremented_block is 0 meaning none has decremented _waiting_for_block
on its behalf.
-- end of diagnostic data --
|
|
Work Around
|
Turning off compiled safepoints(-XX:-UseCompilerSafepoints ) seems to avoid dea
dlocks. However, we are having other stability problems (access
violations and internal HotSpot errors). I am attaching more details(
Dr Watsonon log).
|
|
Evaluation
|
This deadlock does exist. A fix has been coded. Need to find a way to
duplicate the original problem to verify that the fix works.
xxxxx@xxxxx
The above evaluation turned out to be wrong. The problem turns out to be
a code generation issue with c1 that lets a thread escape from the method
we've installed a compiled code safepoint buffer for. Here is an analysis
of a failure seen on sparc. The thread in question has this partial
stack trace:
=>[1] ___lwp_mutex_lock()
[2] os::Solaris::mutex_lock()
[3] os::Solaris::Event::lock()
[4] Mutex::wait_for_lock_implementation()
[5] Mutex::lock_without_safepoint_check()
[6] SafepointSynchronize::block()
[7] ThreadStateTransition::transition()
[8] ThreadStateTransition::trans()
[9] ThreadInVMfromNative::ThreadInVMfromNative()
[10] JVM_ArrayCopy()
[11] 0xf9c0ec7c()
[12] 0xf9cf2e74()
[13] 0xf9c06570()
[14] 0xf9c06570()
...
So the thread has blocked itself but runtime was expecting it to stop via
compiled code safepoint but instead it stopped during a transition. Since it
came from native the runtime assumes it was already counted as blocked and
so it doesn't decrement waiting_to_block and we hang. An analysis of the
JavaThread state and other structures shows that the compiled code safepoint
was put in place for code executing at what is now frame 12. Here is a
snippet of the code surrounding where it was executing at the time.
...
0xf9cf2e68: mov %l3, %o2
0xf9cf2e6c: mov %l4, %o3
0xf9cf2e70: mov %l2, %o4
0xf9cf2e74: call 0xf9cf2e84 /* patched */ RETURN ADDRESS IS HERE!
0xf9cf2e78: nop
0xf9cf2e7c: ba,pt %icc,0xf9cf2db0
0xf9cf2e80: nop
0xf9cf2e84: sethi %hi(0xf5813c00), % PC in ThreadSafepointState
0xf9cf2e88: add %g5, 0x1d0, %g5 NPC in the ThreadSafepointState
0xf9cf2e8c: sethi %hi(0xf9ca1c00), %g3
0xf9cf2e90: jmp %g3 + 0x38 RUNTIME STUB (0xf9ca1940-2654) (prepare_interpreter_call...)
0xf9cf2e94: nop
...
According to the safepoint state we installed the safepoint when the thread was
executing at 0xf9cf2e84. At this point we have already executed the call at
0xf9cf2e74 (that's how we got to this PC) and that is the return address we
see for frame 12. We allocate the new bufferblob and copy the code and patch it.
We patch the call corresponding to call at 0xf9cf2e74 (but we've already passed
it) and we repostion the thread to the code that corresponds to 0xf9cf2e84 and
let it go. It immediately jumps to the runtime stub via the jump corrsponding to
0xf9cf2e90 and we've escaped. Since the runtime only sets traps at places the
compiler told it to do this is a compiler bug. Re-assigning it to c1.
I've attached two files which shows the state of the original method and
the patched method. The reason is that at first glance there are other places
that are not patched where a thread might escape (possibly if we were about
to throw an index range exception?). This might not really be the case depending
on the code flow but since I've got the data I've attached it.
xxxxx@xxxxx 2002-02-20
What steve found is actually not this bug, it's the bug that was marked as a duplicate of this, 4513101. I've created a new bug 4641232 to track the problem in the duplicate.
xxxxx@xxxxx 2002-02-21
We haven't heard back from the Sun contact for the customer about getting a test
case in house for this bug. The only known similar bug, 4641232, was introduced
in 1.4 and was only on SPARC, so is definitely not this problem.
xxxxx@xxxxx added some verification code to 1.3.1 to look for this
case and ran some tests internally, but nothing showed up. Marking the bug as
incomplete.
xxxxx@xxxxx 2002-03-05
This bug has been in the incomplete state for nearly a month. As the only known
problem has been fixed (see above), closing it as "not reproducible".
xxxxx@xxxxx 2002-04-03
|
|
Comments
|
Submitted On 04-SEP-2001
mkenig
What is the release schedule for this bug fix? We are experiencing the exact same problem.
Can we get the fix in advance?
Submitted On 06-SEP-2001
charlesviles
This is a critical issue for us as well. When will a fix or
upgrade be available? We tried with and without
UseCompilerSafeoints in both -client and -server and
experienced JVM level hangs.
Submitted On 07-SEP-2001
mkenig
This parallels our experience. The workaround isn't.
Submitted On 07-SEP-2001
sareddy
we are experiencing the same issue. Ours is a jsp/database
intensive application running on Apache/JServ. We find it
impossible to keep our application running for more than a
day with this problem. This is causing serious delay in
successful deployment of our production system.
Submitted On 16-OCT-2001
crichtn
I believe we are suffering a similar problem with our production server. Please, please fix this problem
quickly....
Submitted On 09-NOV-2001
Dark_Archon
We are also experiencing similar problem. Ours is also a
JSP/Database intensive application running on Sun/Iplanet.
We have been tracing this problem for days. Please release
a fix as soon as possible.
Submitted On 03-DEC-2001
karsh9
Out product too suffers from a similiar problem, and its a
very critical issue for us. When will a patch be available
for 1.3 ?
Submitted On 11-FEB-2002
cotterp
We experience possibly the same problem running with an
application running under Weblogic 5.1 using 1.3.0 or
1.3.1. It just takes longer to happen on 1.3.1. If we run a
Truss -P on the jre PID, the lwp's are sleeping waiting for
a processes to return. We can reproduce this at will
(usually under load) but have not been able to reproduce
(at least not yet) when running the app server on NT. This
caused us to move our app to NT while we figure out what is
going on. Our app is also much faster on NT too.
Submitted On 19-MAR-2002
jokolo
We are also experiencing the same problems. Our env is
Weblogic 5.1 with jdk1.3.1. We are curently using the
XX:+UseBoundThreads option and it seems to be working,
however we are not yet throwing any load at the application
and this is giving us much concern.
Submitted On 19-APR-2002
janicki
Our similar problem only occurred on Sun multi-processor machines. Single CPUs seemed more stable (although tested under lighter loads). Can anyone confirm this multi-processor clue?
OUR WORKAROUND: Java 1.2.2.5 works! (earlier and later versions don't)
Submitted On 07-MAY-2002
prydin
We're experiencing the same problem in a heavily threaded
application under heavy load. This problem prevents us from
deploying our system!
PLEASE NOTE: JDK6 is formerly known as Project Mustang
|
|
|
 |