The problem looks to be an SATB queue being manipulated unsafely across a safepoint.
Here's a concrete example (and I can reproduce similar ones reasonably easily). There's a concurrent marking going on and it is aborted due to a Full GC. Here's the status of one particular thread at the start of the Full GC:
thread 0x02a61800 | 0x072fe6c8 2808 4096
This means: thread 0x02a61800 has an SATB buffer with address 0x072fe6c8 of size 4096 and its current index is 2808 (remember: indexes start from the top, i.e., 4096, and decrease towards 0 in word size steps, i.e. 4 in the case of the 32-bit JVM; so the indexes are really offsets in this case).
During the Full GC the SATB queues on all threads are reset (as the concurrent marking has just been aborted), and this is the status of the above thread at the end of the Full GC:
thread 0x02a61800 | 0x072fe6c8 4096 4096
So, clearly, its associated queue has been reset and its index set to end (4096). Now, here's the status of the same thread at the beginning of the following young GC:
thread 0x02a61800 | 0x072fe6c8 2804 4096
So, clearly, someone just overwrote the index with a value that is its pre-Full GC value minus one slot. And I can reproduce the same scenario. The most common overwriting value is 4092 (i.e., 4096 minus one slot), but some smaller values appear too; and it's always one slot less tha the pre-Full GC value. To me this is a smoking gun that a thread is adding an entry to that SATB queue, caches the old index, then (incorrectly) is stopped at a safepoint, then completes the operation after the safepoint using the old index.
A obvious culprit for this would be JITed code. But I'll see if I can find any places in the runtime where this can happe too.
Posted Date : 2009-01-20 21:41:26.0
I found an example that's slightly different to what I reported above (I'm using the same format as in the previous note). The status of a particular thread before the Full GC:
thread 0x01f40c00 | 0x02c53c78 2324 4096
after the Full GC:
thread 0x01f40c00 | 0x02c53c78 4096 4096
and at the beginning of the following young GC:
thread 0x01f40c00 | 0x02c53c78 4092 4096
So, the index is updated incorrectly but it's not one slot below what it was at the beginning of the Full GC. I think this can still be caused by the queue being manipilated unsafely across a safepoint. A thread checks whether SATB barriers are active and then it's interrupted for the Full GC safepoint. Then, it adds an entry to the queue, even though the queue is supposed to be inactive.
Posted Date : 2009-01-21 16:08:21.0
As expected, the bug is indeed due to a safepoint happening half-way through an SATB write barrier. Below is an e-mail I sent to Tom Rodriguez that explains the issue.
I added some extra verboseness and I can now get the PC of the suspect thread at the Full GC safepoint. I used workshop to print the assembly which I have included below. The PC at the safepoint is at address 0xfa95a654. I've added some comments at the assembly marked with >>> ... <<< and here's some commentary:
In the thread structure, the SATB queue is at offset 680, so the _active flag is at offset 688 and the _index field (i.e, which points to where the last entry written in the buffer is) is at offset 696.
_active and _index are read at 0xfa95a5e0 and 0xfa95a600 respectively.
The safepoint happens at 0xfa95a654.
Then, _active is checked at 0xfa95a688, after the safepoint. Also, _index is checked whether it's 0 (i.e., whether the buffer is full) at 0xfa95a6a4 and manipulated a few instructions below.
So, I think this proves that a safepoint does indeed happen half-way through the barrier and, when we actually use the values of _active and _index, they are in fact stale (as they might have got reset during the safepoint).
Would you mind looking at this further? I have the failed JVM in a debugger if you want to take a peek. Anything else I can do to help out at this point?
Actually, I just thought of one more thing just before I pressed "Send". In the failing example I described, _index is found to be 944 some time after the Full GC (i.e., what was before the Full GC, 948, minus 4). And I've seen this scenario happen quite often. However, with the second patch you sent me (the one that constrained the reading of _buf and _index), even though the bug still happened, I only saw indexes of 4092. This tells me that _index was correctly constrained correctly (as it always picked up the value it was reset to during the Full GC), but we missed constraining _active. Maybe doing that would solve the issue? I hope this makes sense!
0xfa95a5d4: bne,pn %icc,0xfa95ba48 ! 0xfa95ba48
0xfa95a5dc: ldsh [%l2 + 12], %g4
0xfa95a5e0: ldsb [%g2 + 688], %g1 ! >>> read _active into %g1 <<<
0xfa95a5e4: ldsh [%i4 + 8], %o0
0xfa95a5e8: ld [%g2 + 692], %g3
0xfa95a5ec: st %i3, [%sp + 104]
0xfa95a5f0: ld [%l4], %g0
0xfa95a5f4: mov %i0, %i3
0xfa95a5f8: cmp %o0, %g4
0xfa95a5fc: be,pn %icc,0xfa95a674 ! 0xfa95a674
0xfa95a600: ld [%g2 + 696], %l7 ! >>> read _index into %l7 <<<
0xfa95a604: cmp %i1, 0
0xfa95a608: be,pn %icc,0xfa95a75c ! 0xfa95a75c
0xfa95a610: ld [%sp + 100], %i4
0xfa95a614: st %i1, [%sp + 100]
0xfa95a618: ld [%sp + 104], %i3
0xfa95a61c: ba,pt %icc,0xfa95a400 ! 0xfa95a400
0xfa95a624: ld [%sp + 100], %g4
0xfa95a628: ld [%g4 + 28], %i1
0xfa95a62c: cmp %i1, 0
0xfa95a630: be,pn %icc,0xfa95a754 ! 0xfa95a754
0xfa95a640: ld [%i1 + 24], %o0
0xfa95a644: cmp %g4, %o0
0xfa95a648: bne,pn %icc,0xfa95a464 ! 0xfa95a464
0xfa95a650: ld [%i1 + 28], %o0
0xfa95a654: ld [%l4], %g0 ! >>> safepoint! <<<
0xfa95a658: cmp %o0, 0
0xfa95a65c: be,pn %icc,0xfa95a754 ! 0xfa95a754
0xfa95a664: mov %i1, %g4
0xfa95a668: mov %o0, %i1
0xfa95a66c: ba,pt %icc,0xfa95a640 ! 0xfa95a640
0xfa95a674: add %i5, 1, %i0
0xfa95a678: ld [%sp + 96], %l5
0xfa95a67c: cmp %i0, %l5
0xfa95a680: bcc,pn %icc,0xfa95b54c ! 0xfa95b54c
0xfa95a684: add %l3, 16, %i1
0xfa95a688: cmp %g1, 0 ! >>> if (_active) <<<
0xfa95a68c: be,pn %icc,0xfa95a6c0 ! 0xfa95a6c0
0xfa95a694: ld [%i1], %o0
0xfa95a698: cmp %o0, 0
0xfa95a69c: be,pn %icc,0xfa95a6c0 ! 0xfa95a6c0
0xfa95a6a4: cmp %l7, 0 ! >>> if (_index == 0) <<<
0xfa95a6a8: be,pn %icc,0xfa95b2bc ! 0xfa95b2bc
0xfa95a6b0: add %g3, %l7, %g1
0xfa95a6b4: add %l7, -4, %l3
0xfa95a6b8: st %o0, [%g1 - 4]
0xfa95a6bc: st %l3, [%g2 + 696] ! >>> _index = _index - 4 <<<
0xfa95a6c0: mov %i1, %l3
0xfa95a6c4: mov %i4, %g1
0xfa95a6c8: xor %g1, %l3, %l7
0xfa95a6cc: srl %l7, 20, %g1
0xfa95a6d0: cmp %g1, 0
Posted Date : 2009-02-03 22:26:17.0
Tom recommended to try running with -XX:-SplitIfBlocks. This makes the crash happen less often but it does still happen.
Posted Date : 2009-02-04 16:46:26.0
The root of the problem is that SafePointNode don't produce a new raw memory state so using memory dependence to ensure proper code emission for the g1 write barrier doesn't work correctly all the time. The load of raw memory can be moved above a prior safepoint. For now the fix is to assign control to the all the loads involved in the write barrier. Additionally there's some suspicious code in LoadNode::Ideal that strips useless control that I've added a check for raw memory too. Currently it won't trigger for these loads because the ThreadLocal that forms the base is TypeRawPtr::BOTTOM but I think the proper type for ThreadLocal is TypeRawPtr::NOTNULL. There are other nodes which use TypeRawPtr::NOTNULL that might be at risk.
Posted Date : 2009-02-04 21:02:32.0
Posted Date : 2009-02-05 22:10:31.0
Posted Date : 2009-02-12 12:16:40.0