EVALUATION
The root cause of the crashes and assertion failures is a preexisting
bug in the safepoint code which was made more frequent by the
introduction of biased locking and its increase in the number of
safepoints taken.
During a thread state transition from _thread_in_native to
_thread_in_vm a safepoint is taken for a bias revocation requested by
another thread. The write to the memory serialization page, used to
ensure total store ordering for thread states during safepoint checks,
faults. Ordinarily the structured exception handler (SEH) installed by
HotSpot on Windows would be invoked to examine this fault and continue
execution, knowing that the serialization page will be write-enabled
again shortly by the VM thread. However, the SEH is only guaranteed to
be installed in the call_stub, or in other words, when Java frames are
on the stack. In the case of the Java Plug-In, the JVM is created from
native code and the main thread repeatedly invokes JNI functions
without having any Java frames on the stack. When the fault is taken
it looks like an unhandled exception in IEXPLORE.EXE and the browser
crashes.
The above behavior happens with JPICOM.DLL which is apparently the COM
version of the Java Plug-In used for Internet Explorer. Some of the
plugin tests seem to use JPIEXP.DLL. This plugin DLL seems to install
a top-level exception handler which catches the unhandled exception
above and swallows it, so it doesn't look at that point like the JVM
has crashed. It somehow seems to truncate the stack of that thread
back down to the installation point of the SEH in the plugin, so the
JVM state associated with that thread is in an inconsistent state (in
particular, the thread state is still in the transition state which
incurred the serialization page fault). Upon leaving the web page, the
plugin attempts to make a call into the JVM via JNI with this
transition state in the JavaThread and we assert as was seen before.
The fix (suggested by ###@###.###) is to switch back to using
membars only on Windows for certain JVM state transitions which are at
risk, in particular _thread_in_native to _thread_in_vm and vice versa,
and from _thread_in_vm to _thread_blocked and vice versa, when there
is no Java call stub on the stack (and consequently a last_Java_frame,
since we must have transitioned from _thread_in_Java to get into
_thread_in_native). This is the least impact solution. We may later
consider changing to use vectored exception handlers for recent
Windows releases (XP and later), which would gain back any performance
loss from this change and also significantly improve the speed of the
fast JNI GetField accessors developed by ###@###.###.
I also disabled the installation of asynchronous exceptions in the
return path from ThreadToNativeFromVM (used primarily in the boot
class loader) because this could be problematic and lead to assertion
failures or crashes.
This fix was verified manually with the Java Plug-In tests by opening
up the window of potential faults significantly (putting a sleep in
os::serialize_thread_states between the protect and unguard) and
running many tests. Before the fix, with this window expanded the
browser would crash almost immediately; with the fix, it no longer
does. Performance of the fix was verified with the JNI microbenchmarks
by ###@###.###.
This problem is present in the 5.0 release train so this fix will need
to be backported.
|
EVALUATION
The problem is that the JNI thread has the wrong state _thread_in_native.
Its not clear at this time how it got into that state.
It appears that the problem has been introduced with the biased locking putback CR: 6264252.
An workaround to use is -XX:-UseBiasedLocking to the VM, which will make the problem go away.
After testing with 20050804133224.kbr.c2_baseline the problem has disappeared, its quite
possible that this may have moved the problem around, as the fix CR: 6306530 and 6295591,
directly does relate to the original symptom. Which is the thread in the wrong state.
Since there is workaround and a "fix" will be provided indirectly, b47, I am reducing the priority of this
bug to P3.
|