As expected, the bug is indeed due to a safepoint happening half-way through an SATB write barrier. Below is an e-mail I sent to Tom Rodriguez that explains the issue.
I added some extra verboseness and I can now get the PC of the suspect thread at the Full GC safepoint. I used workshop to print the assembly which I have included below. The PC at the safepoint is at address 0xfa95a654. I've added some comments at the assembly marked with >>> ... <<< and here's some commentary:
In the thread structure, the SATB queue is at offset 680, so the _active flag is at offset 688 and the _index field (i.e, which points to where the last entry written in the buffer is) is at offset 696.
_active and _index are read at 0xfa95a5e0 and 0xfa95a600 respectively.
The safepoint happens at 0xfa95a654.
Then, _active is checked at 0xfa95a688, after the safepoint. Also, _index is checked whether it's 0 (i.e., whether the buffer is full) at 0xfa95a6a4 and manipulated a few instructions below.
So, I think this proves that a safepoint does indeed happen half-way through the barrier and, when we actually use the values of _active and _index, they are in fact stale (as they might have got reset during the safepoint).
Would you mind looking at this further? I have the failed JVM in a debugger if you want to take a peek. Anything else I can do to help out at this point?
Actually, I just thought of one more thing just before I pressed "Send". In the failing example I described, _index is found to be 944 some time after the Full GC (i.e., what was before the Full GC, 948, minus 4). And I've seen this scenario happen quite often. However, with the second patch you sent me (the one that constrained the reading of _buf and _index), even though the bug still happened, I only saw indexes of 4092. This tells me that _index was correctly constrained correctly (as it always picked up the value it was reset to during the Full GC), but we missed constraining _active. Maybe doing that would solve the issue? I hope this makes sense!
0xfa95a5d4: bne,pn %icc,0xfa95ba48 ! 0xfa95ba48
0xfa95a5dc: ldsh [%l2 + 12], %g4
0xfa95a5e0: ldsb [%g2 + 688], %g1 ! >>> read _active into %g1 <<<
0xfa95a5e4: ldsh [%i4 + 8], %o0
0xfa95a5e8: ld [%g2 + 692], %g3
0xfa95a5ec: st %i3, [%sp + 104]
0xfa95a5f0: ld [%l4], %g0
0xfa95a5f4: mov %i0, %i3
0xfa95a5f8: cmp %o0, %g4
0xfa95a5fc: be,pn %icc,0xfa95a674 ! 0xfa95a674
0xfa95a600: ld [%g2 + 696], %l7 ! >>> read _index into %l7 <<<
0xfa95a604: cmp %i1, 0
0xfa95a608: be,pn %icc,0xfa95a75c ! 0xfa95a75c
0xfa95a610: ld [%sp + 100], %i4
0xfa95a614: st %i1, [%sp + 100]
0xfa95a618: ld [%sp + 104], %i3
0xfa95a61c: ba,pt %icc,0xfa95a400 ! 0xfa95a400
0xfa95a624: ld [%sp + 100], %g4
0xfa95a628: ld [%g4 + 28], %i1
0xfa95a62c: cmp %i1, 0
0xfa95a630: be,pn %icc,0xfa95a754 ! 0xfa95a754
0xfa95a640: ld [%i1 + 24], %o0
0xfa95a644: cmp %g4, %o0
0xfa95a648: bne,pn %icc,0xfa95a464 ! 0xfa95a464
0xfa95a650: ld [%i1 + 28], %o0
0xfa95a654: ld [%l4], %g0 ! >>> safepoint! <<<
0xfa95a658: cmp %o0, 0
0xfa95a65c: be,pn %icc,0xfa95a754 ! 0xfa95a754
0xfa95a664: mov %i1, %g4
0xfa95a668: mov %o0, %i1
0xfa95a66c: ba,pt %icc,0xfa95a640 ! 0xfa95a640
0xfa95a674: add %i5, 1, %i0
0xfa95a678: ld [%sp + 96], %l5
0xfa95a67c: cmp %i0, %l5
0xfa95a680: bcc,pn %icc,0xfa95b54c ! 0xfa95b54c
0xfa95a684: add %l3, 16, %i1
0xfa95a688: cmp %g1, 0 ! >>> if (_active) <<<
0xfa95a68c: be,pn %icc,0xfa95a6c0 ! 0xfa95a6c0
0xfa95a694: ld [%i1], %o0
0xfa95a698: cmp %o0, 0
0xfa95a69c: be,pn %icc,0xfa95a6c0 ! 0xfa95a6c0
0xfa95a6a4: cmp %l7, 0 ! >>> if (_index == 0) <<<
0xfa95a6a8: be,pn %icc,0xfa95b2bc ! 0xfa95b2bc
0xfa95a6b0: add %g3, %l7, %g1
0xfa95a6b4: add %l7, -4, %l3
0xfa95a6b8: st %o0, [%g1 - 4]
0xfa95a6bc: st %l3, [%g2 + 696] ! >>> _index = _index - 4 <<<
0xfa95a6c0: mov %i1, %l3
0xfa95a6c4: mov %i4, %g1
0xfa95a6c8: xor %g1, %l3, %l7
0xfa95a6cc: srl %l7, 20, %g1
0xfa95a6d0: cmp %g1, 0
The problem looks to be an SATB queue being manipulated unsafely across a safepoint.
Here's a concrete example (and I can reproduce similar ones reasonably easily). There's a concurrent marking going on and it is aborted due to a Full GC. Here's the status of one particular thread at the start of the Full GC:
thread 0x02a61800 | 0x072fe6c8 2808 4096
This means: thread 0x02a61800 has an SATB buffer with address 0x072fe6c8 of size 4096 and its current index is 2808 (remember: indexes start from the top, i.e., 4096, and decrease towards 0 in word size steps, i.e. 4 in the case of the 32-bit JVM; so the indexes are really offsets in this case).
During the Full GC the SATB queues on all threads are reset (as the concurrent marking has just been aborted), and this is the status of the above thread at the end of the Full GC:
thread 0x02a61800 | 0x072fe6c8 4096 4096
So, clearly, its associated queue has been reset and its index set to end (4096). Now, here's the status of the same thread at the beginning of the following young GC:
thread 0x02a61800 | 0x072fe6c8 2804 4096
So, clearly, someone just overwrote the index with a value that is its pre-Full GC value minus one slot. And I can reproduce the same scenario. The most common overwriting value is 4092 (i.e., 4096 minus one slot), but some smaller values appear too; and it's always one slot less tha the pre-Full GC value. To me this is a smoking gun that a thread is adding an entry to that SATB queue, caches the old index, then (incorrectly) is stopped at a safepoint, then completes the operation after the safepoint using the old index.
A obvious culprit for this would be JITed code. But I'll see if I can find any places in the runtime where this can happe too.