Copied from https://coherence.us.oracle.com/jira/browse/COH-7918
When using HotSpot JDKs with the fix for bug 7169050, SelectionService threads end up spinning and being 100% utilized (as per prstat and lockstat output). This problem was originally seen using JDK 8u0b53, but has also been seen on a private of 7u4 with the fix, as well as on 7u7. The Coherence version under test is 12.1.2 build 37092. The situation occurs when using SDMB as well as when using TMB.
The problem does not occur if we use the new Selector in JDK 8 based on the Selector event port mechanism. This requires setting the following flag on the server JVMs:
It turns out that the problem can be triggered without applying a workload at all. Merely starting a Coherence cluster consisting of more than one node (no issue with one node) is sufficient for the problem to manifest itself.
Here is some brief prstat output, after the cluster is up:
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
64 mwm 66 34 0.0 0.0 0.0 0.0 0.0 0.0 1 44 .12 0 java/93
64 mwm 65 34 0.0 0.0 0.0 0.0 0.0 0.0 2 44 .12 0 java/90
64 mwm 70 27 0.0 0.0 0.0 0.0 3.7 0.0 3 44 .10 0 java/79
64 mwm 0.4 0.3 0.0 0.0 0.0 99 0.6 0.0 8 0 117 0 java/2
64 mwm 0.2 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 4 0 java/66
64 mwm 0.2 0.0 0.0 0.0 0.0 100 0.0 0.0 1 0 4 0 java/67
where the first two threads are ExaBus (I/O) threads:
"SelectionService(MultiplexedSelector(sun.nio.ch.DevPollSelectorImpl@781bb763))" daemon prio=3 tid=0x0000000110411180 nid=0x5d runnable [0xffffffff5edfe000]
"SelectionService(MultiplexedSelector(sun.nio.ch.DevPollSelectorImpl@3319087d))" daemon prio=3 tid=0x00000001103bbe00 nid=0x5a runnable [0xffffffff5f3fe000]
and the third thread is:
"Cluster|Member(Id=1, Timestamp=2012-09-14 08:45:01.523, Address=188.8.131.52:8088, MachineId=16712, Location=site:,machine:nit1-ld2-ib,process:64, Role=server)" daemon prio=3 tid=0x00000001101129d0 nid=0x4f runnable [0xffffffff609ff000]
It seems [to me] that the data transfer that occurs when partitions are sent to new cluster members is sufficient to trigger the behavior.
A fair bit of discussion has gone on already on this issue (via email) involving me, Mark Falco, Joy Xiong (PAE, filed 7169050) and Alan Bateman (JDK dev., implemented 7169050). As as result, Mark has stated the following:
"I'm happy to assume for now that the issue is on the Coherence side. Since Jesse can easily reproduce this I think the most useful path will be for he and I to reproduce this in a debugger and step through to see if Coherence is doing something silly like registering write interest and then not writing. If we see that Coherence is being reasonable then the bug would appear to be at the JVM level or below, and we can hand this issue off. My availability is a bit lacking this week, so perhaps rather then a debug session Jesse could just collect a Java heap dump which I can analyze off-line and see if there is enough data there to identify if Coherence is misbehaving, and we can plan for a debugging session next week if the heap dump doesn't yield anything."
That heap dump has been collected and shared with Mark.
Tests are being run on 2 T4-4 (split PCI-E) LDOMs, each running Solaris 11u1b13 (from 03-29-2012).