There are a few ways to deal with this issue.
1. We can drain all available update buffers at the beginning of the cleanup pause, so there will be no buffers that point to empty regions after the cleanup pause. This will increase the duration of the cleanup pause though.
2. We can try to use timestamps the update buffers and regions as they are being allocated to prove that a region was allocated after a buffer was generated and, hence, any cards on that region on said buffers should be ignored. We should be able to re-use the same mechanism for filtering out cards on young regions (which we have to explicitly check for today). I haven't fully thought through how this will work, but maybe this is something we can consider in the future.
3. Change the code that does the humongous allocation to set up the BOT before it updates top. That way, while the BOT is being set up, cards on the regions being allocated will be ignored, as they will reside over top. After the BOT is set up, we can then update top.
I like that both 1 and 2 will avoid processing "out of date" cards, which will make those solutions a bit more robust. On the other hand, 3 is probably the easiest and more localized change (it only touches the humongous object allocation code, which is a relatively infrequent operation). I will try 3 first.
Testing with extra instrumentation has revealed the smoking gun.
I added code at the end of the cleanup pause that goes through all the available update buffers (in the global queue and those currently in use by the Java threads) and checks whether there are any cards available that point to empty regions (hoping to catch the case where there are cards to be processed on regions we just freed up). Extra instrumentation also prints out the ranges of the regions we are freeing up during cleanup.
It took more than half a day for pmd to fail with the extra instrumentation, but it did fail with the data I expected. We have a particular region that is freed during cleanup, there are buffers that contain cards on that region, and it's the region whose BOT we find to be corrupted while we're trying to allocate it as a humongous region shortly after the cleanup pause. I think that's enough proof that the scenario I outlined in the Description is the one that's causing the failure.