G1 shows poor load balancing and high contention behavior during RSet scanning on CMT machines as well as on many-core x86. Previously implemented exponential skipping has unacceptable performance when it comes to extremely large processor count.
The fix would be to implement block-based work stealing in rset scanning phase. It is also required to get rid of copying during this phase and postpone it until the main copying phase. This would also get rid of the buffering that significaly contributes to the observed imbalance.