EVALUATION
To the SDN user who asked which versions this is fixed in:
----------------------------------------------------------
This is a bug in at least each of the 1.3.1, 1.4.2, 5.0 trains,
and is fixed in 1.3.1_17, 1.4.2_11 and 5.0u7 releases of those
trains. It's also fixed in Mustang (6.0) beta b37. This information
is also visible in the "Fixed in" field of the bug report.
|
SUGGESTED FIX
See http://analemma.sfbay/net/spot/scratch/ysr/gclocker/webrev
###@###.### 2005-04-18 21:06:42 GMT
Event: putback-to
Parent workspace: /net/jano.sfbay/export/disk05/hotspot/ws/main/gc_baseline
(jano.sfbay:/export/disk05/hotspot/ws/main/gc_baseline)
Child workspace: /net/prt-web.sfbay/prt-workspaces/20050420100316.ysr.gclocker/workspace
(prt-web:/net/prt-web.sfbay/prt-workspaces/20050420100316.ysr.gclocker/workspace)
User: ysr
Comment:
---------------------------------------------------------
Original workspace: neeraja:/net/spot/scratch/ysr/gclocker
Submitter: ysr
Archived data: /net/prt-archiver.sfbay/data/archived_workspaces/main/gc_baseline/2005/20050420100316.ysr.gclocker/
Webrev: http://analemma.sfbay.sun.com/net/prt-archiver.sfbay/data/archived_workspaces/main/gc_baseline/2005/20050420100316.ysr.gclocker/workspace/webrevs/webrev-2005.04.20/index.html
Fixed 6186200: RFE: Stall allocation requests while heap is full and GC locker is held
http://analemma.sfbay/net/spot/scratch/ysr/gclocker/webrev
The problem was that if the heap is getting close to full
and a thread enters a JNI critical section, then an
allocation request that exceeds the available space
will fail because GC is not allowed and the request
can not be immediately satisfied. This means that
applications that use JNI critical sections, even for
pretty short durations, might be susceptible to strange
and fleeting OOM errors, provided the critical section is
entered at the right time. Several such reports have recently
surfaced in the field (Wachovia, Instinet, SAP, SCT, etc.),
some of them resulting in escalations.
Our fix is to stall such requests until the critical
section has cleared, making a GC possible. For defensive
reasons, if the allocating thread is itself in the critical
section, we do not stall. This avoids self-deadlocks, but
of course does not rule out deadlock possibilities because
of transitive dependencies, not directly or easily visible
or inferable by the VM, from a thread in a JNI critical
section to an allocating thread thus stalled. Clearly,
such dependencies of threads in JNI critical sections
violates the conditions documented in JNI_Get*Critical().
It could be argued that the reflexive dependency is also
such a violation and need not be checked. That is certainly
a reasonable stance; we now flag such reflexive deadlock
possibilities under the -Xcheck:jni flag. I chose
to be liberal in this check since it involves use of
state that is already available in the allocating thread.
(I am easily persuaded not to make a concession for such
self-deadlocks. After all, it could be argued, it's a
good thing not to encourage users to write bad code,
indeed code that violates documented restrictions.)
JVMPI offers an API for disabling GC, which can of course
be used to cause further deadlocks under such circumstances.
The GC locker is also used in certain JVMPI functions to
prevent GC while events are posted. There are example
code paths in the JVM where threads may suspend holding
the GC locker lock, in response to JVMP/DI suspension.
These introduce further situations where the application can
deadlock. In all these cases, we will have replaced a potential
OOM error with a new deadlock. However, JVMPI is going away
in Mustang, so we need not worry about these new deadlock
scenarios. In any event, there are deadlocking modes
possible with these and other JVMPI interfaces even without
the use of GC locker. So I do not believe this is a
major issue, albeit one that should be run by JVMPI users
(tool vendors) before (if/when) being back-ported to Tiger.
This change request is currently before the CCC.
We are putting this fix back so as to
allow nightly GC testing before the next integration.
If the CCC recommends modifications, we'll do so
in a future putback as necessary.
A further GC locker related fix and a code clean-up
is forthcoming in another putback later today under bug id
4828899.
Reviewed by: Alan Bateman, Paul Hohensee
Fix verified: yes
Verification test: A modified version of Mingayo's JNI locker
test run with a small heap so as to increase the
cross-section of the described window of vulnerability
Other testing:
PRT
big apps testing (36 hours with all collectors; thanks LiFeng/June)
runThese -full
refworkload
Files:
update: src/share/vm/gc_implementation/parallelScavenge/parallelScavengeHeap.cpp
update: src/share/vm/memory/collectorPolicy.cpp
update: src/share/vm/memory/gcLocker.cpp
update: src/share/vm/memory/gcLocker.hpp
Examined files: 3240
Contents Summary:
4 update
3236 no action (unchanged)
###@###.### 2005-04-20 21:31:58 GMT
|