Java Solaris Communities Sun Store Join SDN My Profile Why Join?
 
Bug Database
Bug Detail
Quick Lists
Top 25 Bugs
Top 25 RFE's
Recently Closed Bugs
Printable Page Printable Page


Bug Database
Bug ID: 6749695
Votes 0
Synopsis CMS: SIGEGV thrown on CMSCleanOnEnter optimization
Category hotspot:garbage_collector
Reported Against
Release Fixed
State 11-Closed, Not a Defect, bug
Priority: 2-High
Related Bugs
Submit Date 18-SEP-2008
Description
This bug outlines a  bug and fix that a licensee found in CMSCleanOnEnter optimization in 6.0.  The SIGSEGV is below along with the suggested fix from the licensee.

The original issue was a SIGSEGV at:\
\
#9  0x60000000c87233b0:0 in MarkRefsIntoAndScanClosure::do_oop ()\
    at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/concurrentMarkSweepGeneration.cpp:6432\
#10 0x60000000c8778b80:0 in objArrayKlass::oop_oop_iterate_nv_m ()\
    at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/genOopClosures.hpp:391\
#11 0x60000000c876f810:0 in ScanMarkedObjectsAgainCarefullyClosure::do_object_careful_m ()\
    at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/memRegion.hpp:31\
#12 0x60000000c8d791b0:0 in CompactibleFreeListSpace::object_iterate_careful_m ()\
    at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/compactibleFreeListSpace.cpp:788\
#13 0x60000000c87714a0:0 in CMSCollector::preclean_card_table ()\
    at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/concurrentMarkSweepGeneration.cpp:4625\
#14 0x60000000c876e710:0 in CMSCollector::preclean_work ()\
    at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/concurrentMarkSweepGeneration.cpp:4406\
#15 0x60000000c8774510:0 in CMSCollector::abortable_preclean ()\
    at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/concurrentMarkSweepGeneration.cpp:4212\
#16 0x60000000c8768b80:0 in CMSCollector::collect_in_background ()\
    at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/memory/concurrentMarkSweepGeneration.cpp:2262\
#17 0x60000000c92492e0:0 in ConcurrentMarkSweepThread::run ()\
    at /CLO/Components/JAVA_HOTSPOT/Src/src/share/vm/runtime/concurrentMarkSweepThread.cpp:104\
\

After analyzing the core file, we found that the crash was because the CMS GC was trying to mark an  customer  that appears to be\
located within another  customer  (an array), but the _klass word of this  customer  is either 0 or some value that does not point to the perm gen. It appears that the content of the  customer  has been overwritten by the content of the enclosing array.\
\
Unfortunately, we do not have a test case for it.\

We have a theory of what is causing the CMS crash. 

The problem lies in the CMSCleanOnEnter optimization that's in 6.0.\
\
void MarkFromRootsClosure::scanOopsInOop(HeapWord* ptr) \{\
  ......\
  if (CMSCleanOnEnter && (_finger > _threshold)) \{\
    HeapWord* old_threshold = _threshold;\
    _threshold = (HeapWord*)round_to((intptr_t)_finger, CardTableModRefBS::card_size);\
    MemRegion mr(old_threshold, _threshold);\
    _mut->clearRange(mr);\
  \}\
  ......\
\}\
\
Meanwhile, a  customer  could be concurrently promoted to the CMS gen, whose location could be corresponding to the above range.\
void CMSCollector::promoted(bool par, HeapWord* start,\
                            bool is_obj_array, size_t obj_size) \{\
  if (_collectorState >= Marking) \{\
    _markBitMap.mark(start);\
    if (_collectorState < Sweeping) \{\
          _modUnionTable.mark(start);\
    \}\
  \}\
\}\
\
In MarkFromRootsClosure::scanOopsInOop(), If thisOop <= _threshold, then the memory region cleared by _mut->clearRange(mr) call can be divided into:\
    [_threshold, _finger), [_finger, round_to(_finger, card_size))\
which translates into a range that's within the current  customer , and one that's after the current  customer .\
\
If thisOop > _threshold, then the memory region can be divided into:\
    [_threshold,thisOop), [thisOop, _finger), [_finger, round_to(_finger, card_size))\
which translates into ranges that's before, within, and after the current  customer .\
\
For the memory range that's within the current  customer , it is safe to clear the MUT because the  customer  is going to be scanned later in the function.\
\
For the memory range that's after the current  customer , it is also safe to clear the MUT because if there is a concurrent promotion, the promoted() function also marks the corresponding bit in the marking bitmap. So when the current marking phase scans that bit in the bitmap later, it will scan that  customer  anyway.\
\
For the memory range that's before the current  customer , I suspect that it might not be safe to clear the MUT. This range was unmarked when scanning the marking bitmap. However, if an  customer  was concurrently promoted to this range between the time the previous and the current bit is scanned in the bitmap, and if it is cleared from the MUT now, then this promoted  customer  will not be scanned later, and hence objects referenced from this promoted  customer  will be garbage collected by mistake.\
\
}
Posted Date : 2008-09-18 00:34:07.0
Work Around
-XX:-CMSCleanOnEnter

(note that when running multiple CMS threads, which happens by default when #gc threads >=4,
this will not be an issue because the optimization is turned off by default.)
Evaluation
Customer is right. The clearing code needs to be fixed so as not to clear
the "vulnerable gap" identified by the submitter. This simple fix will
be made soon; watch the suggested fix section.
Posted Date : 2008-10-23 17:17:07.0

6749695 CMS: SIGEGV thrown on CMSCleanOnEnter optimization

   http://bugs.sun.com/view_bug.do?bug_id=6749695
   (modulo a bugs.sun.com bug that causes "object" to appear as "customer")

   webrev: http://webrev.invokedynamic.info/ysr/6749695/

The problem, as described in the bug report, was that we were
clearing too many cards, namely those prior to the "finger",
which would not be scanned after the clearing of the MUT card,
thus compromising correctness.
The fix is to clear only the portion after the start of the
object. Note that this optimization triggers only when the
marking phase finds objects that straddle (or start at)
page boundaries, limiting its efficacy somewhat.

Testing: jprt with -XX:-UseParNewGC; no test case provided

Thanks for your reviews.
-- ramki
Posted Date : 2008-11-14 20:09:12.0

Upon further review this is not a bug.
Posted Date : 2009-01-13 20:52:10.0
Comments
  
  Include a link with my name & email   


PLEASE NOTE: JDK6 is formerly known as Project Mustang