Do not use -XX:+ParallelRefProcEnabled, or explicitly switch it off
As in evaluation/comments section. Watch this space for diffs
and regression test case.
In light of this, one will probably also need to run some of the older
performance tests anew to assess efficacy of the parallel reference
Incomplete marking during parallel work queue overflow (overflow
list was being ignored) during parallel reference processing
(marking) phase. Simple fix, will need to be verified by customer.
A "day one" bug in the handling of the overflow list used in
parallel rescan and in parallel reference processing
has been found and fixed. This fix applies to CMS with
parallel remark, even in the absence of parallel reference
processing, and needs to be backported to 6.0, 5.0 and
1.4.2_XX (XX >= 14) as well.
This bug is fixed (see suggested fix section). In the course of
stress testing related to this bug, a new as-yet-undiagnosed
bug came to light. That's being tracked under 6578335.
This bug fix should be backported to 6, 5 and 1.4.2; see subCR's.
From View message header detail "Y. S. Ramakrishna" <###@###.###>
Sent Thursday, July 12, 2007 11:36 am
Subject Code Manager notification (putback-to)
Parent workspace: /net/jano2.sfbay/export2/hotspot/ws/main/gc_baseline
Child workspace: /net/prt-web.sfbay/prt-workspaces/20070712093851.ysr.mustang/workspace
Job ID: 20070712093851.ysr.mustang
Original workspace: neeraja:/net/jano2.sfbay/export2/hotspot/users/ysr/mustang
Archived data: /net/prt-archiver.sfbay/data/archived_workspaces/2007/20070712093851.ysr.mustang/
Fixed 6558100: CMS crash when -XX:+ParallelRefProcEnabled is set
Partial 6572569: CMS: consistently skewed work distribution indicated in (long) re-mark pauses
When CMS marking (either during parallel rescan or parallel reference processing)
runs out of space on the per-worker work queues, the overflown grey objects
are tracked by chaining through their mark word. In this case, we had two
bugs: firstly, the method that took a prefix of the overflow list was not
re-attaching the intended suffix correctly (this affects all JVM's going
back to 1.4.2_14); secondly, the parallel reference processing code was
entirely neglecting to process the overflow list (this affects JVM's going
back to 5.0). The crucial debugging breakthrough came when Poonam used
the SA to track down the objects that CMS remark was declaring as
unreachable but unmarked, and found that they occurred in long chains
linked via their mark word (but with the promoted bit not set, which
helped distinguish them from the promoted chains that ParNew uses, and
identified them as broken fragments of an erstwhile overflow list).
Many thanks to Poonam Bajaj and Thomas Viessmann for crucial
debugging help. The customer has since run with a version of 6u2
with the fix (thanks Poonam) and verified that the previous crash
does not reproduce in > 2 days (previously the crash would happen in
about 4 hours).
Some debugging code was added as well as some asserts relaxed
to allow for the possibility of examining an object lying at the end
of the overflow list. This latter issue will be more thoroughly revisited
and cleaned up under a separate bug id.
When CMSScavengeBeforeRemark is set, we were assuming that a scavenge
would have necessarily preceded a remark and that therefore the heap
would already be in a parsable state. However, it is possible that
the scavenge may not have been done because, for instance, a JNI
critical section was held. The main CR here will need other work to
deal with the issue found at the customer, but this is a fix for
the problem with CMSScavengeBeforeRemark which is a temporary workaround
to this customer's performance issue as described in the bug report.
Thanks to Chris Phillips for testing and backport help with 5uXX where
the problem manifested most readily.
Reviewed by: Jon Masamitsu & Andrey Petrusenko
Fix Verified: y
6558100: GCBasher on CMS with CMSMarkStackOverflowALot enabled
6572569: GCBasher on CMS with CMSScavengeBeforeRemark & no survivor spaces
PRT (also with CMS stress options)
refworkload, runThese -quick and -testbase
Note added in proof: Some late breaking big apps testing using the
stress flags yesterday revealed an as-yet-undiagnosed issue when
running Tomcat and ATG. Thanks to Ashwin for finding this issue,
which is being tracked under CR 6578335.
Examined files: 3991
3988 no action (unchanged)
This CR includes two bugs, one in parallel reference processing
(in 7, 6 and 5) for which the workaround is -XX:-ParallelRefProcEnabled;
and another in parallel remark (in 7, 6, 5 and 1.4.2_14+) for
which the workaround is -XX:-CMSParallelRemarkEnabled.
Note that -ParallelRefProcEnabled is in fact the default,
while +CMSParallelRemarkEnabled is the default.
Turning off parallelism in either case can adversely affect
CMS parallel remark pauses.
6558100 happens when there is task queue overflow in CMS || remark.
In this case the overflowed objects are chained via their
mark words. The bug is that later we forget to process
these overflowed objects. In effect, the reachability
closure is not computed past these objects. So any
objects that are reached only through these overflown objects
will not be marked and will be collected. These collected
objects will end up on the CMS free list. When the next
collection starts, the marking phase might reach these
now-freed objects, and the marker barfs trying to scan
these free blocks as though they were objects.
The symptom of the collector thread barfing while marking
could potentially come from a host of possible root causes
in the VM, including missing card-marks, a bug in the CMS
precleaning or, in this case, a bug in the CMS remark (or
parallel reference processing).
Identifying an instance of 6263371 as a duplicate of 6558100:-
The key to the identification is that although the overflown
objects are not scanned, so the objects that they point to
were collected prematurely, the overflow objects themselves
are not collected because they had been marked before placing
on the overflow list. But the reason the marker barfs is that
it is trying to scan one of these prematurely collected objects
that is referenced by some field of an overflown object.
If you connect to the core file using the SA, and give it the
address of the oop which we were trying to scan when we segv'd
and ask it to find all locations in the heap that contain that
address. There are likely to be only a few (for reasons that
will become clear below). Look at each of these locations
in turn. Each will be a normal object, except that it will
have a strange looking mark word. The mark word will have the
address of another object, which will likewise be a normal
looking object with a strange looking mark word and so on.
You have just found part of the overflow list from the
previous remark, and have identified an instance of 6558100.
The above is just one symptom of 6558100. There are likely to
be others. In partricular, since the mark word has been, in effect,
clobbered and might have contained locking information or identity
hash code, any computation on an object that involves synchronization
or hashcode use could return the wrong answer. Possibilities
may include IllegalMonitorStateException, may be biased locking
malfunction (although i have not worked this through in detail)
or other such weird behaviour.