FULL PRODUCT VERSION :
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
vm_info: Java HotSpot(TM) 64-Bit Server VM (21.0-b17) for solaris-amd64 JRE (1.7.0-b147), built on Jun 27 2011 02:29:28 by "" with Sun Studio 12u1
FULL OS VERSION :
OS: Oracle Solaris 11 Express snv_146 X86
Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved.
Assembled 12 August 2010
uname:SunOS 5.11 snv_146 i86pc (T2 libthread)
rlimit: STACK 10240k, CORE infinity, NOFILE 65536, AS infinity
load average:1.55 0.76 0.68
CPU:total 24 (6 cores per cpu, 1 threads per core) family 16 model 8 stepping 0, cmov, cx8, fxsr, mmx, sse, sse2, sse3, popcnt, mmxext, 3dnowpref, lzcnt, sse4a
Memory: 4k page, physical 268434660k(205784976k free)
EXTRA RELEVANT SYSTEM CONFIGURATION :
Oracle Sun X4440 with Quad-HexCore (24Cores Total) 128GB DRAM.
1x 1Gb Ethernet Port "igb0" utilized for Network Traffic.
Solaris 11 Build 146 (Just slightly newer than Solaris 11 Express)
Java 7 Build 147
A DESCRIPTION OF THE PROBLEM :
Attempts to balance Eden vs. Old vs. Perm vs. StackSize vs. Runtime.exec() vs. DirectMemory
allocation sizes and avoid (a)Full GC DenialOfService, (b)Perm tagging Temporary Data,
(c)15-250 Active Thread Counts, (d)Avoid OutOfMemory due to 32Bit JVM Barriers, and
(e)Asynchronous I/O Workloads is resulting in GC oops under
"HeapWord*ParallelScavengeHeap::failed_mem_allocate(unsigned long,bool)+0x97".
Newly changed are GC Memory optimizations, please let me know if these are wrong and/or
if I could assist by reproducing with other settings. :)
-d64 (Required to avoid "Runtime.exec()" OOME failures and greater than -Xmx3550m support)
-Xms22000m (=Xmx ~10% DRAM - How large does Java7 GC scale stably?)
-Xmn21984m (=(Xmx - 16M) attempting to avoid Full GC DenialOfService : Newly added option)
-Xmx22000m (=Xms to avoid OutOfMemoryError with less than Mmx Heap allocated...different issue)
-Xss256k (to work with d64 option)
-XX:MaxPermSize=16m (enough for JVM, Selectors, ScheduledThreadPool : Newly added option)
-XX:MaxDirectMemorySize=1536m (Required for AsynchronousFileChannel I/O Workloads : Newly added option)
-XX:+UseNUMA (10-30% Performance Boost due to Quad-HexCore physical layout : Newly added option)
THE PROBLEM WAS REPRODUCIBLE WITH -Xint FLAG: Did not try
THE PROBLEM WAS REPRODUCIBLE WITH -server FLAG: Did not try
REGRESSION. Last worked in version 7
STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Reproduction (reproduced 8 times in last month):
1. Solaris 11 Build 146, Java7 Build 147, and ABT 9.2 Build 2011.10.20.17.17
# pkgadd -d /net/net0.us.oracle.com/export/auto/proto/jbuckley_q3fy11_3/packages/i386/SUNWstc-abt.pkg all
# ln -s /net/net0.us.oracle.com/export/auto/action /opt/SUNWstc-abt/lib/action
# svcadm restart abt ( may be needed to wakeup smf :)
2. Start IO Workload (below specific to test rig reproducing all 12 cases):
# /home/jbuckley/startIO
... essentially starts the following:
a. Two Selector Threads managing ServerSocketChannel(s), SocketChannel(s), and
DatagramChannel(s) (JavaTIDs 8 & 9, Native ThreadID=29&30) - Minimal Memory Use.
b. ServerSocketChannel(s) and Bond DatagramChannels for TCP/IP & UDP/IP for
ABT, ECHO, DISCARD, DAYTIME, SMTP, TIME, SNMP, SYSLOG Services. These are completely
managed via two Selector threads, above. All incoming SMTP, SNMP, Syslog messages are
appended to ACTIVE Actions, below.
c. Two ScheduledThreadPoolExecutor Threads managing LowMemory check (calling System.gc()
at 95% heap usage) and new Action Interpretation into Action Trees (below).
(JavaTIDs 11&12, Native ThreadIDs=31&32) - Minimal Memory Use.
d. Several Thousand Actions (50-150 at any time) are cycled through ExecutorCompletionService
providing the following cradle-to-grave life cycle states seen in Thread Names - Large variations
in memory use as Actions are generated, archived, and recycled - Mostly (90%) Short Lived Data
on the order of 5 minute life cycle - Some (9%) Medium Lived Data on the order of 1-5 hours life
cycle - Rare (1%) Longer Lived Data on the order of 1-5 days life cycle - NO (0%) Perm Data:
INIT : Allocate Action Memory
SUBMIT : Allocate Thread from CompletionService
ACTIVE : Conduct Testing, potentially locally via "RunTime.exec(...)" or remotely via Socket Transfers
REMOTE : Optionally Transfer Testing to Remote Host
CANCEL : Optionally Abort Testing
RETURN : Optionally Transfer Testing from Remote Host
REUSE : ResubmitAction/EmailLogs/etc.
ARCHIVE : Persist Memory based logs to disk
REDUCE : Reduce Memory usage to miminal for parent Action ARCHIVE to assist with GC
REPORT : Wait for Parent Archive Completion
RECYCLE : Final Action Shreading to assist with GC
IDLE : Reallocation of Thread to Pool and "if (ACTIONS.count() < 10) Runtime.getRuntime().gc();"
d. /usr/jdk/jdk1.7.0/bin/jconsole & used to monitor JVM. - Minimal Memory Use.
e. Actions interact with 2 Linux Clients running ABT, 1 Solaris 11 Client running ABT and a
ZFSSA/7000 Cluster to do several configuration/load/stress tests.
I have moved the old hs_err_pid####.log files to here:
http://net0.us.oracle.com/shares/export/auto/reports/auto4440-05/
EXPECTED VERSUS ACTUAL BEHAVIOR :
Behavior is seen as the following:
1. Initial/Expected behavior is stable.
a. Without Actions, JVM has minimal heap and perm memory easily fits in <<16MB.
b. With Actions, JVM has varying between 68MB to 16GB without issues as workloads varies
from more serial execution to more concurrent execution.
2. Near Death behavior becomes GC Denial-Of-Service.
a. GC (Native Thread 21) takes increasing amount of cycles - 10, 40, 80, 100% active in prstat -maL
b. Eventually, GC (Native TID 21) takes 100% activity and all other threads go 0% flatline.
3. Death behavior has some common themes.
a. HeapWord*ParallelScavengeHeap::failed_mem_allocate(unsigned long,bool)+0x97
b. "lgrp # space ####K, 100% used" or "PSOldGen total 16384K, used 16384K"
ERROR MESSAGES/STACK TRACES THAT OCCUR :
# /opt/SUNWstc-abt/bin/hs_err_pid22967.log
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0xfffffd7ffd4044d4, pid=22967, tid=21
#
# JRE version: 7.0-b147
# Java VM: Java HotSpot(TM) 64-Bit Server VM (21.0-b17 mixed mode solaris-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x3b44d4] void PSPromotionFailedClosure::do_object(oopDesc*)+0x8
#
# Core dump written. Default location: /opt/SUNWstc-abt/bin/core or core.22967
#
# If you would like to submit a bug report, please visit:
# http://bugreport.sun.com/bugreport/crash.jsp
#
--------------- T H R E A D ---------------
Current thread (0x0000000001856000): VMThread [stack: 0xfffffd7ff6636000,0xfffffd7ff6736000] [id=21]
siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x0000000262400000
Registers:
RAX=0x0000000000000000, RBX=0xfffffd7ffe2323f0, RCX=0xfffffd7ffe22baa0, RDX=0x0000000000000018
RSP=0xfffffd7ff6735570, RBP=0xfffffd7ff6735580, RSI=0x0000000262400000, RDI=0xfffffd7ff6735670
R8 =0x0000000000411410, R9 =0xfffffd7ff67358a0, R10=0x0000000000000010, R11=0xfffffffffbc05970
R12=0xfffffd7ffe218a3c, R13=0x0000000262400000, R14=0xfffffd7ff6735670, R15=0x000000000041ed20
RIP=0xfffffd7ffd4044d4, RFLAGS=0x0000000000010282
Top of Stack: (sp=0xfffffd7ff6735570)
0xfffffd7ff6735570: fffffd7ff9f8e710 0000000000000005
0xfffffd7ff6735580: fffffd7ff67355d0 fffffd7ffd5c8089
0xfffffd7ff6735590: fffffd7ff67355b0 fffffd7ffe210694
0xfffffd7ff67355a0: fffffd7ff6735670 00000000023d62b0
0xfffffd7ff67355b0: 0000000000000012 fffffd7ffe22b9c0
0xfffffd7ff67355c0: 000000000041eb90 fffffd7ff6735670
0xfffffd7ff67355d0: fffffd7ff67355f0 fffffd7ffd5ed7f6
0xfffffd7ff67355e0: 000000000041e940 000000000041eb90
0xfffffd7ff67355f0: fffffd7ff67356b0 fffffd7ffd5ebf7f
0xfffffd7ff6735600: 00000007ee25fbe0 00000000023d3b00
0xfffffd7ff6735610: 00000000023d6310 00000000023dbae8
0xfffffd7ff6735620: fffffd7ff9ee69e8 fffffd7ff9ee6820
0xfffffd7ff6735630: 00000007f8602ec0 0000000000000001
0xfffffd7ff6735640: fffffd7ff9ee69e8 fffffd7ffe22bcc0
0xfffffd7ff6735650: 0000000000000012 fffffd7ffe211150
0xfffffd7ff6735660: 000000000054bf50 0000000001754890
0xfffffd7ff6735670: fffffd7ffe22baa0 fffffd7ffd4f0b00
0xfffffd7ff6735680: 000000000041e940 00000000023d62b0
0xfffffd7ff6735690: 0000000000000012 fffffd7ffe22b9c0
0xfffffd7ff67356a0: 000000000041e940 0000000000000000
0xfffffd7ff67356b0: fffffd7ff6735960 fffffd7ffded17b9
0xfffffd7ff67356c0: 000000000046cb48 000000017e9a18d0
0xfffffd7ff67356d0: 0000000667e61d78 000000000041eb90
0xfffffd7ff67356e0: 000000000041ec30 0000000000421160
0xfffffd7ff67356f0: 000000000041ddf0 fffffd7ffe22a040
0xfffffd7ff6735700: 000000000054bf50 000000000046c760
0xfffffd7ff6735710: 000000000046c770 fffffd7ffe211150
0xfffffd7ff6735720: 000000017d9a18f0 0000000000000001
0xfffffd7ff6735730: 00000002a1ffffe0 00000000001ffffc
0xfffffd7ff6735740: 000000010000000a 000002ec4928353a
0xfffffd7ff6735750: 000002ec66f28949 0000000000000000
0xfffffd7ff6735760: 000000000041e9d0 0000000000000001
Instructions: (pc=0xfffffd7ffd4044d4)
0xfffffd7ffd4044b4: 20 d9 00 48 8b 00 48 03 98 90 00 00 00 c6 03 11
0xfffffd7ffd4044c4: 41 5c 5b c9 c3 00 00 00 55 48 8b ec 48 83 ec 10
0xfffffd7ffd4044d4: 8b 06 48 83 e0 03 48 83 f8 03 75 37 48 8b 05 01
0xfffffd7ffd4044e4: fc d8 00 80 38 00 75 06 48 8b 46 08 eb 13 48 8b
Register to memory mapping:
RAX=0x0000000000000000 is an unknown value
RBX=0xfffffd7ffe2323f0: __1cIUniverseP_methodKlassObj_+0x78 in /usr/jdk/jdk1.7.0/jre/lib/amd64/server/libjvm.so at 0xfffffd7ffd050000
RCX=0xfffffd7ffe22baa0: __1cHnmethodG__vtbl_+0x5bd0 in /usr/jdk/jdk1.7.0/jre/lib/amd64/server/libjvm.so at 0xfffffd7ffd050000
RDX=0x0000000000000018 is an unknown value
RSP=0xfffffd7ff6735570 is an unknown value
RBP=0xfffffd7ff6735580 is an unknown value
RSI=0x0000000262400000 is an unknown value
RDI=0xfffffd7ff6735670 is an unknown value
R8 =0x0000000000411410 is an unknown value
R9 =0xfffffd7ff67358a0 is an unknown value
R10=0x0000000000000010 is an unknown value
R11=0xfffffffffbc05970 is an unknown value
R12=0xfffffd7ffe218a3c: __1cJCodeCacheF_heap_+0x13f3c in /usr/jdk/jdk1.7.0/jre/lib/amd64/server/libjvm.so at 0xfffffd7ffd050000
R13=0x0000000262400000 is an unknown value
R14=0xfffffd7ff6735670 is an unknown value
R15=0x000000000041ed20 is an unknown value
Stack: [0xfffffd7ff6636000,0xfffffd7ff6736000], sp=0xfffffd7ff6735570, free space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x3b44d4] void PSPromotionFailedClosure::do_object(oopDesc*)+0x8
V [libjvm.so+0x578089] void MutableSpace::object_iterate(ObjectClosure*)+0x4d
V [libjvm.so+0x59d7f6] void PSYoungGen::object_iterate(ObjectClosure*)+0x1e
V [libjvm.so+0x59bf7f] void PSScavenge::clean_up_failed_promotion()+0xab
V [libjvm.so+0xe817b9] bool PSScavenge::invoke_no_policy()+0x1369
V [libjvm.so+0xe802f7] void PSScavenge::invoke()+0x3b
V [libjvm.so+0xe48433] HeapWord*ParallelScavengeHeap::failed_mem_allocate(unsigned long,bool)+0x97
V [libjvm.so+0x49bd33] void VM_ParallelGCFailedAllocation::doit()+0x93
V [libjvm.so+0x49b795] void VM_Operation::evaluate()+0x79
V [libjvm.so+0x621dc8] void VMThread::run()+0x548
V [libjvm.so+0xe22fc9] java_start+0x4a9
C [libc.so.1+0x112b94] _thrp_setup+0xbc
C [libc.so.1+0x112e50] _lwp_start+0x0
VM_Operation (0xfffffd7ff2816e50): ParallelGCFailedAllocation, mode: safepoint, requested by thread 0x00000000018cb000
--------------- P R O C E S S ---------------
Java Threads: ( => current thread )
0x000000000258c800 JavaThread "process reaper" daemon [_thread_in_native, id=1048, stack(0xfffffd7ff248b000,0xfffffd7ff24cb000)]
0x00000000019bc000 JavaThread "process reaper" daemon [_thread_in_native, id=1047, stack(0xfffffd7ff2179000,0xfffffd7ff21b9000)]
0x0000000002588800 JavaThread "Idle Thread" [_thread_blocked, id=1044, stack(0xfffffd7ff2138000,0xfffffd7ff2178000)]
0x0000000007556000 JavaThread "STATE_ACTIVE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0941.DelayBeforePingAttempts Test Ready" [_thread_blocked, id=1043, stack(0xfffffd7ff2652000,0xfffffd7ff2692000)]
0x0000000003807000 JavaThread "STATE_ACTIVE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0940.TimeRebootToPing Test Ready" [_thread_blocked, id=1042, stack(0xfffffd7ff223c000,0xfffffd7ff227c000)]
0x000000000766c800 JavaThread "STATE_REDUCE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0932.Unmount7000Shares Test Archive" [_thread_blocked, id=1036, stack(0xfffffd7ff21ba000,0xfffffd7ff21fa000)]
0x0000000001d50800 JavaThread "STATE_ACTIVE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0882.ManageSsh_ExecuteAkshScript OutputTransfer Seeking" [_thread_in_native, id=1035, stack(0xfffffd7ff21fb000,0xfffffd7ff223b000)]
0x0000000003702800 JavaThread "STATE_ACTIVE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0882.ManageSsh_ExecuteAkshScript Test Command
"ssh -t -o StrictHostKeyChecking=no root@auto7320-07"" [_thread_blocked, id=1033, stack(0xfffffd7ff23c8000,0xfffffd7ff2408000)]
0x00000000071a8000 JavaThread "STATE_ACTIVE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0948.Destroy7000EcmaClusterFS Test Ready" [_thread_blocked, id=1032, stack(0xfffffd7ff227d000,0xfffffd7ff22bd000)]
0x000000000754f000 JavaThread "STATE_REDUCE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0931.Unmount7000Shares Test Archive" [_thread_blocked, id=1031, stack(0xfffffd7ff22be000,0xfffffd7ff22fe000)]
0x0000000003708000 JavaThread "STATE_ACTIVE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0956.Destroy7000EcmaClusterLun
Test Ready" [_thread_blocked, id=1029, stack(0xfffffd7ff2409000,0xfffffd7ff2449000)]
0x00000000034cf800 JavaThread "STATE_ACTIVE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0955.Destroy7000EcmaClusterLun
Test Ready" [_thread_blocked, id=1028, stack(0xfffffd7ff244a000,0xfffffd7ff248a000)]
0x00000000034d1000 JavaThread "STATE_ACTIVE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0950.CheckOptions Test Ready" [_thread_blocked, id=1026, stack(0xfffffd7ff24cc000,0xfffffd7ff250c000)]
0x0000000001dee000 JavaThread "STATE_ACTIVE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0882.ManageSsh_ExecuteAkshScript OutputTransfer Seeking" [_thread_in_native, id=1025, stack(0xfffffd7ff258f000,0xfffffd7ff25cf000)]
0x0000000007444800 JavaThread "STATE_REDUCE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0930.Unmount7000Shares Test Archive" [_thread_blocked, id=1024, stack(0xfffffd7ff25d0000,0xfffffd7ff2610000)]
0x0000000001a55000 JavaThread "RMI TCP Connection(18)-10.80.205.160" daemon [_thread_blocked, id=1023, stack(0xfffffd7ff2611000,0xfffffd7ff2651000)]
0x0000000001eee800 JavaThread "STATE_ACTIVE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/1109.CheckOptions Test Ready" [_thread_blocked, id=1021, stack(0xfffffd7ff2693000,0xfffffd7ff26d3000)]
0x00000000036f1000 JavaThread "STATE_ACTIVE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/1101.CheckOptions Test Ready" [_thread_blocked, id=1020, stack(0xfffffd7ff26d4000,0xfffffd7ff2714000)]
0x0000000002ef6000 JavaThread "STATE_REDUCE :Verify7000TestSuiteWrapper.111020.172333.auto4440-05.22967.10.9/0881.ManageSsh_VerifySshable Test Archive" [_thread_blocked, id=1019, stack(0xfffffd7ff2715000,0xfffffd7ff2755000)]
0x00000000025c0800 JavaThread "Idle Thread" [_th
( This report has more than 16,000 characters and has been truncated. )
|