SUGGESTED FIX
http://analemma.sfbay/net/spot/scratch/ysr/mut/webrev
--------------------------
Fix put back to Mustang workspace:
Event: putback-to
Parent workspace: /net/jano.sfbay/export/disk05/hotspot/ws/main/gc_baseline
(jano.sfbay:/export/disk05/hotspot/ws/main/gc_baseline)
Child workspace: /prt-workspaces/20040722160055.ysr.dragon/workspace
(prt-web:/prt-workspaces/20040722160055.ysr.dragon/workspace)
User: ysr
Comment:
---------------------------------------------------------
Original workspace: neeraja:/net/spot/scratch/ysr/dragon
Submitter: ysr
Archived data: /net/prt-archiver.sfbay/export2/archived_workspaces/main/gc_baseline/2004/20040722160055.ysr.dragon/
Webrev: http://analemma.sfbay.sun.com/net/prt-archiver.sfbay/export2/archived_workspaces/main/gc_baseline/2004/20040722160055.ysr.dragon/workspace/webrevs/webrev-2004.07.22/index.html
Fixed 5037027: CMS: precleaning causes crash if perm gen collection enabled
http://analemma.sfbay/net/spot/scratch/ysr/mut/webrev
or
http://analemma.sfbay/net/spot/scratch/ysr/dragon
(for some reason, the sccs comments show up mangled in
this latter webrev; refer to the former webrev for
clean sccs comments.)
There was a coding bug in the precleaning loop in the method
preclean_mod_union_table(), which, with CMSPermGenPrelceaningEnabled
off, would clear mod-union-table entries for the perm gen
without actually precleaning the corresponding objects.
This can cause intra-generational oop-updates in the
perm gen to be ignored by the concurrent collector
and lead to perm gen (and in rare cases other gen)
objects to be recycled prematurely.
The bug was masked until CMSPermGenPrecleaningEnabled
was switched off recently to workaround bug 5040363.
Reviewed by: jmasa, pbk (some cleanups suggested by pbk deferred)
Fix Verified: yes
Verification testing:
In Tiger:
---------
runThese with -server -XX:+ShowMessageBoxOnError -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:-CMSPermGenPrecleaningEnabled -XX:+CMSPrecleaningEnabled -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=2 -XX:+DisableExplicitGC
In Mustang:
-----------
runThese with -server -XX:+ShowMessageBoxOnError -XX:+PrintGCDetails -XX:+UseC
oncMarkSweepGC -XX:-CMSPermGenPrecleaningEnabled -XX:+CMSPrecleaningEnabled -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled -XX:ExplicitGCInvokesConcurrent
Other Testing: CMS (with & without above "stress" option list)
spec, PRT, refWorkload, runThese, cloudscape,
HP's class unloading test
Files:
update: src/share/vm/memory/concurrentMarkSweepGeneration.cpp
update: src/share/vm/memory/concurrentMarkSweepGeneration.hpp
Examined files: 3222
Contents Summary:
2 update
3220 no action (unchanged)
-----------------------------------------
Fix also put back to Tiger:
Job submitted at: 10:01:38 AM
Total job time: 1h 48m 49s
Job state: success
Job fail/kill comment: NoComment
Job flags: PUTBACK ARCHIVE SYNC-WORKSPACE
Original workspace: neeraja:/net/spot/scratch/ysr/mut
Parent workspace: /net/jano.sfbay/export/disk05/hotspot/ws/1.5/tiger_baseline
Submitter: ysr
PRT data: /net/prt-web.sfbay/prt-workspaces/20040726100038.ysr.mut
Archived data: ERROR, no archive file generated
Webrev: No webrev was generated
Fixed 5037027: CMS: precleaning causes crash if perm gen collection enabled
http://analemma.sfbay/net/spot/scratch/ysr/mut/webrev
There was a coding bug in the precleaning loop in the method
preclean_mod_union_table(), which, with CMSPermGenPrelceaningEnabled
off, would clear mod-union-table entries for the perm gen
without actually precleaning the corresponding objects.
This can cause intra-generational oop-updates in the
perm gen to be ignored by the concurrent collector
and lead to perm gen (and in rare cases other gen)
objects to be recycled prematurely.
The bug was masked until CMSPermGenPrecleaningEnabled
was switched off recently to workaround bug 5040363.
Thanks to June for demonstrating, using ATG and IMM/S1AS, that
CMS/GC during start-up (when perm gen mutation rates are extremely
high, thus increasing exposure to this bug) would expose the customers
to this bug -- evidence that convinced the core team to approve
this bug for Tiger after two initial rejections.
Thanks to Francis Hsu for turning around the requisite PIT
tests at short notice; and to Alan Bateman for interpreting
some results.
Thanks also to the Portal Server team (Russ Petruzzelli and
Young Kwon) for making available test machines for running some
load tests (which however did not exhibit this bug).
Reviewed by: jmasa, pbk (some cleanups suggested by pbk deferred)
Approved by: Server Core Team
Fix Verified: yes
Verification testing:
In Tiger:
---------
. runThese with -server -XX:+ShowMessageBoxOnError -XX:+PrintGCDetails -XX:+UseConcMarkSweepGC -XX:-CMSPermGenPrecleaningEnabled -XX:+CMSPrecleaningEnabled -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=2 -XX:+DisableExplicitGC
. Big apps testing by June: (with CMSClassUnloadingEnabled CMSPermGenSweepingEnabled)
IMM/S1AS
ATG
In Mustang:
-----------
runThese with -server -XX:+ShowMessageBoxOnError -XX:+PrintGCDetails -XX:+UseC
oncMarkSweepGC -XX:-CMSPermGenPrecleaningEnabled -XX:+CMSPrecleaningEnabled -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled -XX:ExplicitGCInvokesConcurrent
Other Testing: CMS (with & without above "stress" option list)
spec, PRT, refWorkload, runThese, cloudscape,
HP's class unloading test
Big apps testing by June: IMM/S1AS, ATG
PIT testing by Francis Hsu
|
WORK AROUND
The bug needs the following set of conditions to manifest:
(1) CMSPermGenPrecleaningEnabled is false (this is the default in
1.5 and 1.4.2_05)
(2) CMSPrecleaningEnabled is true (this is the default)
(3) CMSPermGenSweepingEnabled and CMSClassUnloading is true
(these are _not_ the defaults in Tiger or earlier)
(4) There is a scavenge during the CMS concurrent marking phase
(this will usually be the case for all but the smallest
old gen's)
(5) There is no concurrent mode failure before the end of the
CMS remark phase
The only known workarounds are:
(1) to switch off all precleaning: -XX:-CMSPrecleaningEnabled
But that workaround is not practically viable because it
would CMS remark pauses very long and thus usually almost
completely defeat CMS' primary purpose;
OR (* see Note in (2) below)
(2) to switch off perm gen collection: -XX:-CMSPermGenSweepingEnabled
-XX:-CMSClassUnloadingEnabled
[* NOTE: this would very greatky reduce, but not completely
elinminate, the risk of a crash.]
But that again is not practically viable at least for applications
that have no bound on Perm space allocation (i.e. apps that
always load new classes), since that would make the occasional
full collection onevitable which would blow the GC pause times
just like above.
|