|
Description
|
This bug outlines a bug and fix that a licensee found in CMSCleanOnEnter optimization in 6.0. The SIGSEGV is below along with the suggested fix from the licensee.
The original issue was a SIGSEGV at:\
\
#9 0x60000000c87233b0:0 in MarkRefsIntoAndScanClosure::do_oop ()\
at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/concurrentMarkSweepGeneration.cpp:6432\
#10 0x60000000c8778b80:0 in objArrayKlass::oop_oop_iterate_nv_m ()\
at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/genOopClosures.hpp:391\
#11 0x60000000c876f810:0 in ScanMarkedObjectsAgainCarefullyClosure::do_object_careful_m ()\
at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/memRegion.hpp:31\
#12 0x60000000c8d791b0:0 in CompactibleFreeListSpace::object_iterate_careful_m ()\
at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/compactibleFreeListSpace.cpp:788\
#13 0x60000000c87714a0:0 in CMSCollector::preclean_card_table ()\
at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/concurrentMarkSweepGeneration.cpp:4625\
#14 0x60000000c876e710:0 in CMSCollector::preclean_work ()\
at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/concurrentMarkSweepGeneration.cpp:4406\
#15 0x60000000c8774510:0 in CMSCollector::abortable_preclean ()\
at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/concurrentMarkSweepGeneration.cpp:4212\
#16 0x60000000c8768b80:0 in CMSCollector::collect_in_background ()\
at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/concurrentMarkSweepGeneration.cpp:2262\
#17 0x60000000c92492e0:0 in ConcurrentMarkSweepThread::run ()\
at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/runtime/concurrentMarkSweepThread.cpp:104\
\
After analyzing the core file, we found that the crash was because the CMS GC was trying to mark an customer that appears to be\
located within another customer (an array), but the _klass word of this customer is either 0 or some value that does not point to the perm gen. It appears that the content of the customer has been overwritten by the content of the enclosing array.\
\
Unfortunately, we do not have a test case for it.\
We have a theory of what is causing the CMS crash.
The problem lies in the CMSCleanOnEnter optimization that's in 6.0.\
\
void MarkFromRootsClosure::scanOopsInOop(HeapWord* ptr) \{\
......\
if (CMSCleanOnEnter && (_finger > _threshold)) \{\
HeapWord* old_threshold = _threshold;\
_threshold = (HeapWord*)round_to((intptr_t)_finger, CardTableModRefBS::card_size);\
MemRegion mr(old_threshold, _threshold);\
_mut->clearRange(mr);\
\}\
......\
\}\
\
Meanwhile, a customer could be concurrently promoted to the CMS gen, whose location could be corresponding to the above range.\
void CMSCollector::promoted(bool par, HeapWord* start,\
bool is_obj_array, size_t obj_size) \{\
if (_collectorState >= Marking) \{\
_markBitMap.mark(start);\
if (_collectorState < Sweeping) \{\
_modUnionTable.mark(start);\
\}\
\}\
\}\
\
In MarkFromRootsClosure::scanOopsInOop(), If thisOop <= _threshold, then the memory region cleared by _mut->clearRange(mr) call can be divided into:\
[_threshold, _finger), [_finger, round_to(_finger, card_size))\
which translates into a range that's within the current customer , and one that's after the current customer .\
\
If thisOop > _threshold, then the memory region can be divided into:\
[_threshold,thisOop), [thisOop, _finger), [_finger, round_to(_finger, card_size))\
which translates into ranges that's before, within, and after the current customer .\
\
For the memory range that's within the current customer , it is safe to clear the MUT because the customer is going to be scanned later in the function.\
\
For the memory range that's after the current customer , it is also safe to clear the MUT because if there is a concurrent promotion, the promoted() function also marks the corresponding bit in the marking bitmap. So when the current marking phase scans that bit in the bitmap later, it will scan that customer anyway.\
\
For the memory range that's before the current customer , I suspect that it might not be safe to clear the MUT. This range was unmarked when scanning the marking bitmap. However, if an customer was concurrently promoted to this range between the time the previous and the current bit is scanned in the bitmap, and if it is cleared from the MUT now, then this promoted customer will not be scanned later, and hence objects referenced from this promoted customer will be garbage collected by mistake.\
\
}
Posted Date : 2008-09-18 00:34:07.0
|
|
Evaluation
|
Customer is right. The clearing code needs to be fixed so as not to clear
the "vulnerable gap" identified by the submitter. This simple fix will
be made soon; watch the suggested fix section.
Posted Date : 2008-10-23 17:17:07.0
6749695 CMS: SIGEGV thrown on CMSCleanOnEnter optimization
http://bugs.sun.com/view_bug.do?bug_id=6749695
(modulo a bugs.sun.com bug that causes "object" to appear as "customer")
webrev: http://webrev.invokedynamic.info/ysr/6749695/
The problem, as described in the bug report, was that we were
clearing too many cards, namely those prior to the "finger",
which would not be scanned after the clearing of the MUT card,
thus compromising correctness.
The fix is to clear only the portion after the start of the
object. Note that this optimization triggers only when the
marking phase finds objects that straddle (or start at)
page boundaries, limiting its efficacy somewhat.
Testing: jprt with -XX:-UseParNewGC; no test case provided
Thanks for your reviews.
-- ramki
Posted Date : 2008-11-14 20:09:12.0
Upon further review this is not a bug.
Posted Date : 2009-01-13 20:52:10.0
|