United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: 7014261 G1: RSet-related failures
7014261 : G1: RSet-related failures

Details
Type:
Bug
Submit Date:
2011-01-24
Status:
Closed
Updated Date:
2011-03-07
Project Name:
JDK
Resolved Date:
2011-03-07
Component:
hotspot
OS:
generic
Sub-Component:
gc
CPU:
generic
Priority:
P2
Resolution:
Fixed
Affected Versions:
hs20
Fixed Versions:
hs21

Related Reports
Backport:
Backport:
Backport:
Relates:

Sub Tasks

Description
Since the push of 6977804: G1: remove the zero-filling thread we are seeing intermittent JPRT failures with GCBasher / G1, mainly with product builds. The failure usually complains about a double free, like this one:

*** glibc detected ***
> >>>>> /tmp/jprt/P2/T/111410.et151817/testproduct/linux_x64_2.4-product/bin/java:
> >>>>>
> >>>>> double free or corruption (!prev): 0x00007f59c8524ce0 ***

I also saw (once) a failure in fastdebug complaining about an apparent inconsistency in the RSets:

#  Internal Error (/tmp/jprt/P3/B/164311.ap31282/source/src/share/vm/gc_implementation/g1/sparsePRT.cpp:172),
  pid=20790, tid=1083717968
#  guarantee(_entries != NULL) failed: INV

                                    

Comments
EVALUATION

http://hg.openjdk.java.net/hsx/hsx20/baseline/rev/4e66274b6bb3
                                     
2011-02-09
EVALUATION

http://hg.openjdk.java.net/jdk7/hotspot-gc/hotspot/rev/97ba643ea3ed
                                     
2011-01-26
SUGGESTED FIX

The fix is to purge the expanded list from entries that correspond to regions that are being cleaned up (those will just be dealt with by the concurrent cleanup process). The way I chose to implement it is to actually null the expanded list at the beginning of cleanup and recreate it during cleanup, ignoring regions that were freed. Each cleanup thread creates a local expanded "sublist" (so that no locks / atomics are needed while creating those) and all the sublists are merged right at the end.
                                     
2011-01-25
EVALUATION

We know what the race is:

A heap region's RSet comprises several tables including a "sparse" table. Sparse tables have two RSHashTables: cur and next. Those two usually point to the same physical table. When we want to expand a sparse table we create a new next RSHashTable, which is larger than the old cur, and we copy the contents from cur into next. For a while the sparse table has two RSHashTables: next where new entries are added, cur which is used for iterations. (Note: when we add new entries to an RSet during a pause we generally have the make sure we scan those specially; so we only need to iterate over cur while scanning the RSet and we can safely ignore next.)

Expanded sparse tables are added on a list (the "expanded list") so that we process them before we iterate over the RSets at the beginning of a pause. "Processing" them involves freeing the old cur and replacing it with next.

The race is as follows:

We reclaim several regions during cleanup that have expanded sparse tables and those tables are on the expanded list. Those regions are added on the cleanup list. 
Thread 1: the concurrent cleanup start processing the cleanup list and clears the RSet of every region on it, including its sparse table.
Thread 2: the VM thread that's processing the expanded list; it frees up the old cur RSHashTable of each sparse table and replaces it with next.
Given that the concurrent cleanup operation can now work through a pause, Threads 1 and 2 can now race and reach the same sparse table. This can result in the two failures we're seeing:
- one deleting the cur entry first, while another trying to delete it and finding that it's already been deleted (that's the guarantee, the destructor is the only place where _entries is set to NULL)
- both threads trying to delete the same entry, which explains the double-free.

The race happens due to the increased concurrency that was introduced by 6977804. Before, the concurrent cleanup operation and a pause were mutually exclusive, which is why we never hit the issue.
                                     
2011-01-24



Hardware and Software, Engineered to Work Together