United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: 6847956 G1: crash in oopDesc*G1ParCopyHelper::copy_to_survivor_space(oopDesc*)
6847956 : G1: crash in oopDesc*G1ParCopyHelper::copy_to_survivor_space(oopDesc*)

Details
Type:
Bug
Submit Date:
2009-06-04
Status:
Resolved
Updated Date:
2011-03-02
Project Name:
JDK
Resolved Date:
2009-11-11
Component:
hotspot
OS:
generic,solaris_10
Sub-Component:
gc
CPU:
sparc,generic
Priority:
P3
Resolution:
Fixed
Affected Versions:
6u14,6u16
Fixed Versions:
hs17

Related Reports
Backport:
Backport:
Backport:
Backport:
Backport:
Backport:
Backport:
Backport:
Relates:
Relates:
Relates:
Relates:

Sub Tasks

Description
Customer got a crash in the G1 immediately after the start:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xfe87933c, pid=8596, tid=7
#
# JRE version: 6.0_14-b08
# Java VM: Java HotSpot(TM) Server VM (14.0-b16 mixed mode solaris-sparc )
# Problematic frame:
# V  [libjvm.so+0x47933c]
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#
...
Registers:
 O0=0x00036900 O1=0x0002cd80 O2=0x00000000 O3=0x00000000
 O4=0x00000001 O5=0x00000000 O6=0xfbd7d428 O7=0xfe8a67fc
 G1=0x01ba85ea G2=0x00000000 G3=0xcb5e0000 G4=0xffffe25c
 G5=0x00000000 G6=0x00000000 G7=0xfe261000 Y=0x00000000
 PC=0xfe87933c nPC=0xfe879340
...
Instructions: (pc=0xfe87933c)
0xfe87932c:   97 31 70 3f da 06 c0 13 98 9a e0 01 d8 23 a0 64
0xfe87933c:   de 03 61 14 96 03 e0 01 02 40 00 08 d6 23 a0 60
...
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x47933c] oopDesc*G1ParCopyHelper::copy_to_survivor_space(oopDesc*)+0xb0
V [libjvm.so+0x4814e0] void G1ParCopyClosure::do_oop_work(oopDesc**)+0xa0
V [libjvm.so+0x47db98] void BufferingOopClosure::process_buffer()+0x48
V [libjvm.so+0x47b10c] void G1CollectedHeap::g1_process_strong_roots(bool,SharedHeap::ScanningOption,OopClosure*,OopsInHeapRegionClosure*,OopsInHeapRegionClosure*,OopsInGenClosure*,int)+0x25c
V [libjvm.so+0x47d688] void G1ParTask::work(int)+0x520
V [libjvm.so+0x810388] void GangWorker::loop()+0x8c
V [libjvm.so+0x6f54f0] java_start+0x234 
...
---------------  S Y S T E M  ---------------

OS:                       Solaris 10 11/06 s10s_u3wos_10 SPARC
           Copyright 2006 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                           Assembled 14 November 2006

uname:SunOS 5.10 Generic_120011-14 sun4u  (T2 libthread)
rlimit: STACK 8192k, CORE infinity, NOFILE 65536, AS infinity
load average:0,34 0,31 0,29

CPU:total 14 has_v8, has_v9, has_vis1, has_vis2, is_ultra3

Memory: 8k page, physical 83886080k(56274704k free)

vm_info: Java HotSpot(TM) Server VM (14.0-b16) for solaris-sparc JRE (1.6.0_14-b
08), built on May 21 2009 01:43:32 by "" with Workshop 5.8

time: Thu Jun  4 11:10:53 2009
elapsed time: 0 seconds

The full hs_err_pid8596.log is attached to this report.

                                    

Comments
EVALUATION

http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/1f19207eefc2
                                     
2009-10-06
EVALUATION

For what it is worth: It would be really nice not to have to mark objects as we copy them to the survivors (to avoid the extra overhead during the GC pause, as well as avoid having to notify the marking phase that those objects have moved). Note that, if we do several GC pauses during a marking phase, the majority of objects in the survivors would be objects that were allocated since the start of the marking phase which, according to the SATB invariants, we do not have to visit during the marking phase; it's only the objects in the survivors after the initial-mark pause we really need to visit. I'll open a CR to track this idea (it's CR 6888336).
                                     
2009-10-05
WORK AROUND

-XX:MaxTenuringThreshold=0
                                     
2009-10-03
EVALUATION

The incomplete marking issue is caused because, when marking is in progress, we deal with the survivors spaces incorrectly.

In G1, there are two ways in which an object is considered live. First, if it's marked in the bitmap. Second, if it's over the "TAMS" (top at mark start) variable of its containing region. And we have two copies of this liveness information, one it's the "previous" (the last one that was obtained and which is known to be consistent), one it's the "next" (thte one currently in progress which might be inconsistent). Here we deal with the next marking info, as it's the one that's being obtained during the marking cycle.

One more thing to point out is that, in G1, when we evacuate objects during evacuation pauses, if they are considered live we also have to explicitly mark them in their new location too (typically, by marking them in the bitmap). In some cases we also have to notify the marking threads that an object has been evacuated.

The bug is caused because, during marking, we explicitly set the NTAMS (next TAMS) variable of each region that contains survivors to bottom, thus making all its contents implicitly live. Consider the following scenario, we have 

a -> b -> c

with a and b being in a survivor space, and c being, say, in the old generation. Let's also assume that, when we start the evacuation pause, a is marked, b and c are not.

When we copy a and b to a survivor region, we'll propagate a's mark to the bitmap, notify the marking threads to have to visit it, and then set the NTAMS field of that region to bottom, making them both implicitly marked (note that a is both explicitly and implicitly marked at this point).

When marking finally comes across a it says "ah, b is already live" (because it's over NTAMS) and it incorrectly doesn't process it further. As a result, b is never visited by the marking threads and c is never marked.
                                     
2009-10-03
EVALUATION

I should have added: Typically, I could get the test to fail within 30 mins and after 3 marking cycles at most (typically, it'd fail after the first). I ran with the fix overnight for 12+ hours and 360+ marking cycles with no failures.
                                     
2009-10-03
SUGGESTED FIX

The fix is straightforward:

heapRegion.hpp:

   void note_end_of_copying() {
-    assert(top() >= _next_top_at_mark_start,
-           "Increase only");
-    // Survivor regions will be scanned on the start of concurrent
-    // marking.
-    if (!is_survivor()) {
+    assert(top() >= _next_top_at_mark_start, "Increase only");
       _next_top_at_mark_start = top();
     }
-  }
                                     
2009-10-03



Hardware and Software, Engineered to Work Together