Allocation of big long lived array in heap significally reduce performance of Parallel and Conc GC. In mustang parallel works fine, but Conc still very slow.
See my reproduction source in attachment. SerialGC is more than 10 times faster than Conc.
java -XX:+PrintGCDetails -Xmx768m -XX:+Use<GC> gctest.Main
###@###.### 2005-07-19 13:56:08 GMT
I've looked at the execution profiles for the VM
after the fix for the BOT was implemented and the
a large part of the remaining costs are in the promotion handling code.
Specifically, the code that links promoted objects together
in a list may need some work. There was nothing there that
looked like it related specifically to big objects so this bug
is being moved to fix delivered.
The following fix was putback by Jon to gc_baseline on 2/17
and will integrate into Mustang b75 (nee b74):
Parent workspace: /net/jano.sfbay/export/disk05/hotspot/ws/main/gc_baseline
Child workspace: /net/prt-web.sfbay/prt-workspaces/20060217102834.jmasa.gc_baseline_6298694/workspace
Job ID: 20060217102834.jmasa.gc_baseline_6298694
Original workspace: arches:/net/karachi/bigtmp/jmasa/gc_baseline_6298694
Archived data: /net/prt-archiver.sfbay/data/archived_workspaces/main/gc_baseline/2006/20060217102834.jmasa.gc_baseline_6298694/
Partial 6298694: bad performance with big object in heap
This is partial because there may be more improvements to be
made for large objects.
Change the initialization of the block offset table to
use the logrithmic offsets. Line 98 in the webrev for blockOffsetTable.cpp
is the significant change. The rest of the changes are clean up.
Removed some debugging code (old version of a method that was
being used for verification).
Renamed a parameter to fix_up_alloced_region() from "start_card"
to "first_card_to_fix" since "start_card" often is the first card
set for a newly allocated block. Clarified (hopefully) the
specification for fix_up_alloced_region().
Reviewed by: Ramki (partial) , John, and Tony.
Approved for putback by: Dave C.
Fix verified (y/n): y
Ran the test program attached to the CR and noted a
decrease in the minor collection pause by about
2/3's (approximately 10s to approximately 3.5s).
runThese -quick -testbase_vm -testbase_gc with sparc product
and fastdebug builds
refworkload reference_server runs were done with sparc product.
Examined files: 3790
3788 no action (unchanged)
The root cause of this bug is the the block offset table (BOT) is initialized
for single card offsets in the constructor for BlockOffsetArray. This is
does not lead to a correctness issue. The performance problem arises from
the free list allocation done by CMS. If a chunk is split in order to
do an allocation, it is assumed that the BOT for the original chunk is
correct and the BOT for the remainder after the split is updated. For
some situations (for example the first initialization of large chunks
out of the dictionary) this leaves the BOT using the single card offsets
instead of the logrithmic offsets. This probably evertually works itself
out but is particularly obvious in the test case for the problem.
The initialization of the BOT needs to fixed. Since the contiguous space
version of the BOT will set the BOT for blocks as allocations move to
the right in the heap, initializing for logrithmics strides is probably
ok but that should be verified.