United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: 7013718 G1: small fixes for two assert/guarantee failures
7013718 : G1: small fixes for two assert/guarantee failures

Details
Type:
Bug
Submit Date:
2011-01-20
Status:
Closed
Updated Date:
2011-04-24
Project Name:
JDK
Resolved Date:
2011-04-24
Component:
hotspot
OS:
generic
Sub-Component:
gc
CPU:
generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
hs20
Fixed Versions:
hs21

Related Reports
Backport:
Backport:
Backport:
Relates:
Relates:

Sub Tasks

Description
The changes for the zero filling thread removal (6977804) introduced two minor assert failures:

Internal Error at g1CollectorPolicy.cpp:1771, pid=28818, tid=12
guarantee(hr->is_young() && hr->age_in_surv_rate_group() != -1) failed: invariant

and

Internal Error (/tmp/jprt/P3/B/003718.ap31282/source/src/share/vm/gc_implementation/g1/heapRegionSeq.cpp:201), pid=6070, tid=139783754409744
assert(0 < _alloc_search_start && _alloc_search_start <= _regions.length()) failed: _alloc_search_start: 0 should be valid

                                    

Comments
EVALUATION

http://hg.openjdk.java.net/hsx/hsx20/baseline/rev/bd2e08334e84
                                     
2011-02-09
EVALUATION

http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/a672e43650cc
                                     
2011-01-22
SUGGESTED FIX

For the assert failure:

Only check the assert if find_contiguous() returns a value that's not -1 (i.e., it found an entry that satisfies the allocation request):

--- a/src/share/vm/gc_implementation/g1/heapRegionSeq.cpp
+++ b/src/share/vm/gc_implementation/g1/heapRegionSeq.cpp
@@ -195,10 +195,10 @@
     assert(0 <= res && res < _regions.length(),
            err_msg("res: %d should be valid", res));
     _alloc_search_start = res + (int) num;
+    assert(0 < _alloc_search_start && _alloc_search_start <= _regions.length(),
+           err_msg("_alloc_search_start: %d should be valid",
+                   _alloc_search_start));
   }
-  assert(0 < _alloc_search_start && _alloc_search_start <= _regions.length(),
-         err_msg("_alloc_search_start: %d should be valid",
-                 _alloc_search_start));
   return res;
 }
                                     
2011-01-20
EVALUATION

For the assert failure:

The assert in question:

    assert(0 < _alloc_search_start && _alloc_search_start <= _regions.length(),
           err_msg("_alloc_search_start: %d should be valid",
                   _alloc_search_start));

is too strong. If _alloc_search_start is 0 when we first call the find_contiguous() method and we fail to find a series of regions that satisfies the allocation request, then it will still be 0 at the end of the method where the assert is (in fact, this is what's happening; the error message prints out its value: "_alloc_search_start: 0"). The assert only holds if find_contiguous() returns a valid reesult and _alloc_search_start += num (so that it cannot be 0).
                                     
2011-01-20
SUGGESTED FIX

For the guarantee failure:

Make sure the marking thread joins the SuspendibleThreadSet before it calls record_concurrent_mark_cleanup_completed():

--- a/src/share/vm/gc_implementation/g1/concurrentMarkThread.cpp
+++ b/src/share/vm/gc_implementation/g1/concurrentMarkThread.cpp
@@ -251,7 +251,9 @@
 
         // Now do the remainder of the cleanup operation.
         _cm->completeCleanup();
+        _sts.join();
         g1_policy->record_concurrent_mark_cleanup_completed();
+        _sts.leave();
 
         double cleanup_end_sec = os::elapsedTime();
         if (PrintGC) {
                                     
2011-01-20
EVALUATION

For the guarantee failure:

First, I should point out that this happens _very_ infrequently. Also note that this failure is not directly caused by the new free list changes but it's uncovered by them (before it would never happen as a pause would always wait for the concurrent cleanup operation to complete; now it can carry on through the pause).

The stack trace is this:

current thread: t@12
  [1] __read(0x0, 0xfffffd7ff35fe310, 0x10, 0x0, 0x1, 0x7fe0), at 0xfffffd7fff293eda 
  [2] read(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff279f21 
=>[3] os::message_box(title = ???, message = ???) (optimized), at 0xfffffd7fe8442a59 (line ~4161) in "os_solaris.cpp"
  [4] VMError::show_message_box(this = ???, buf = ???, buflen = ???) (optimized), at 0xfffffd7fe8a07d1e (line ~50) in "vmError_solaris.cpp"
  [5] VMError::report_and_die(this = ???) (optimized), at 0xfffffd7fe8a0748d (line ~786) in "vmError.cpp"
  [6] report_vm_error(file = ???, line = ???, error_msg = ???, detail_msg = ???) (optimized), at 0xfffffd7fe73226d7 (line ~216) in "debug.cpp"
  [7] G1CollectorPolicy::predict_survivor_regions_evac_time(this = ???) (optimized), at 0xfffffd7fe750fddc (line ~1771) in "g1CollectorPolicy.cpp"
  [8] G1CollectorPolicy::calculate_young_list_target_length(this = ???, rs_lengths = ???) (optimized), at 0xfffffd7fe750dd32 (line ~499) in "g1CollectorPolicy.cpp"
  [9] G1CollectorPolicy::record_concurrent_mark_cleanup_completed(this = ???) (optimized), at 0xfffffd7fe7512664 (line ~470) in "g1CollectorPolicy.cpp"
  [10] ConcurrentMarkThread::run(this = ???) (optimized), at 0xfffffd7fe729f6d1 (line ~254) in "concurrentMarkThread.cpp"
  [11] java_start(thread_addr = ???) (optimized), at 0xfffffd7fe843349e (line ~1067) in "os_solaris.cpp"
  [12] _thrp_setup(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff28acf5 
  [13] _lwp_start(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff28afb0 

So, at the end of concurrent cleanup we call record_concurrent_mark_cleanup_completed() which updates some of the prediction fields. However, given that the concurrent mark thread does not join the Suspendible Thread Set before it does that, that operation could happen during a pause. This is the log fragment that hints that this is what might be happening:

[GC concurrent-cleanup-start]
[GC pause (young) 257M->257M(2032M), 0.0652511 secs]
[GC pause (partial)# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc:  SuppressErrorAt=/g1CollectorPolicy.cpp:1771
==============================================================================
Unexpected Error
------------------------------------------------------------------------------
Internal Error at g1CollectorPolicy.cpp:1771, pid=28818, tid=12
guarantee(hr->is_young() && hr->age_in_surv_rate_group() != -1) failed: invariant

Do you want to debug the problem?

To debug, run 'dbx - 28818'; then switch to thread 12
Enter 'yes' to launch dbx automatically (PATH must include dbx)
Otherwise, press RETURN to abort...
==============================================================================
 259M->254M(2032M), 0.1267242 secs]
                                     
2011-01-20



Hardware and Software, Engineered to Work Together