Java Solaris Communities Sun Store Join SDN My Profile Why Join?
 
Bug Database
Bug Detail
Quick Lists
Top 25 Bugs
Top 25 RFE's
Recently Closed Bugs
Printable Page Printable Page


Bug Database
Bug ID: 6671051
Votes 0
Synopsis (process) Runtime.exec() hangs if signalled during fork/exec
Category java:classes_lang
Reported Against
Release Fixed 7(b25), 6-open(b07) (Bug ID:2159936) , 6u7(b01) (Bug ID:2160173)
State 10-Fix Delivered, bug
Priority: 2-High
Related Bugs 6474073 , 6667180 , 6678286 , 4811767 , 4052517
Submit Date 04-MAR-2008
Description
We are seeing threads hung in java.lang.Runtime.exec() on Solaris 10. The child process is created, but the Runtime.exec() call never returns.

In this example the calling thread is 103(0x67) and the child pid is 14159(0x374f) - I didn't capture the output that shows that.

Here is the backtrace (from jstack). 

"CacheManagerTaskExecutor-15" prio=10 tid=0x0840d400 nid=0x67 sleeping[0xb48a7000..0xb48a7b70]
   java.lang.Thread.State: RUNNABLE
        at java.lang.UNIXProcess.forkAndExec(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:53)
        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
        at java.lang.Runtime.exec(Runtime.java:593)
        at java.lang.Runtime.exec(Runtime.java:431)
        at java.lang.Runtime.exec(Runtime.java:328)
        at com.sun.hss.domain.util.misc.ExpectCommandExecutor.execute(ExpectCommandExecutor.java:48)
        - locked <0xbc1e72d0> (a com.sun.hss.domain.util.misc.ExpectCommandExecutor)
        at com.sun.hss.domain.util.ipmi.IPMICommand.exec(IPMICommand.java:168)
        at com.sun.hss.domain.util.ipmi.IPMICommand.exec(IPMICommand.java:145)
        at com.sun.hss.domain.driver.generic.GenericServerMXBeanHelper.execIPMICommand(GenericServerMXBeanHelper.java:406)
        at com.sun.hss.domain.util.ipmi.AbstractIPMICommandHelper.getFruMap(AbstractIPMICommandHelper.java:279)
        at com.sun.hss.domain.driver.v20z.V20zServerMXBeanHelper.getSerialNumber(V20zServerMXBeanHelper.java:211)
        at com.sun.hss.domain.driver.generic.AbstractServerMXBean.getSerialNumber(AbstractServerMXBean.java:87)
        at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:167)
        at com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:96)
        at com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:33)
        at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
        at com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:65)
        at com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:216)
        at javax.management.StandardMBean.getAttribute(StandardMBean.java:358)
        at com.sun.hss.domain.AbstractResourceMXBean.getAttribute(AbstractResourceMXBean.java:209)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:666)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638)
        at com.sun.jdmk.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:178)
        at com.sun.hss.domain.ForwardingMBeanServerInterceptor.getAttribute(ForwardingMBeanServerInterceptor.java:178)
        at com.sun.hss.domain.ConcurrentMBeanServerInterceptor.getAttribute(ConcurrentMBeanServerInterceptor.java:837)
        at com.sun.hss.domain.ForwardingMBeanServerInterceptor.getAttribute(ForwardingMBeanServerInterceptor.java:178)
        at com.sun.hss.domain.ForwardingMBeanServerInterceptor.getAttribute(ForwardingMBeanServerInterceptor.java:178)
        at com.sun.cacao.agent.DispatchInterceptor.getAttribute(DispatchInterceptor.java:506)
        at com.sun.cacao.agent.auth.impl.AccessControlInterceptor.getAttribute(AccessControlInterceptor.java:310)
        at com.sun.jdmk.JdmkMBeanServerImpl.getAttribute(JdmkMBeanServerImpl.java:640)
        at com.sun.cacao.cached.impl.server.ServerCachedMBeanProxy.rereadAttribute(ServerCachedMBeanProxy.java:326)
        at com.sun.cacao.cached.impl.server.ServerCachedMBeanProxy.refreshCachedAttributes(ServerCachedMBeanProxy.java:295)
        at com.sun.cacao.cached.impl.server.RefreshAttributeTask.execute(RefreshAttributeTask.java:60)
        at com.sun.cacao.cached.impl.server.TaskPool$TaskExecutor.run(TaskPool.java:334)


pstack shows the following for the thread (note that the 374f parameter corresponds to the child in question):
-----------------  lwp# 103 / thread# 103  --------------------
 fef30ee7 waitid   (0, 374f, b48a6fc0, 3)
 fef25b52 waitpid  (374f, 0, 0) + 75
 fe488242 Java_java_lang_UNIXProcess_forkAndExec (840d4ec, b48a7170, b48a716c, b48a7168, 7, 0) + 6ca
 fb40a598 ???????? (0, fb408219, bc1e7260, bc1e7250, bc1e7240, 0)
 fb402de9 ???????? (0, 0, 0, 0, 7, bc1e71f0)
 fb7f62b0 ???????? ()

This appears to be stuck in the following code in Java_java_lang_UNIXProcess_forkAndExec():
    close(fail[1]); fail[1] = -1; /* See: WhyCantJohnnyExec */
    if (read(fail[0], &errnum, sizeof(errnum)) != 0) {
        waitpid(resultPid, NULL, 0);
        throwIOException(env, errnum, "Exec failed");
        goto Catch;


Although this is highly reproducible in our lab stress tests, it was hard to produce when running truss. However, in this case I was able to capture it with truss -l -t read,waitid. Here is the truss output from the time of the failure (note the lwp id is 103, consistent with the above):
/103:         siginfo: SIGJVM1 pid=13520 uid=0 code=-1
/103:    0.0024 read(309, 0xB48A70B4, 4)                        Err#4 EINTR
/103:   waitid(P_PID, 14159, 0xB48A6FC0, WEXITED|WTRAPPED) (sleeping...)

It looks like there is a bug in the code above. It assumes that a non-zero return from the read system call represents a successful read of errnum, while in this case, the return value is actually -1. This causes the calling thread to wait for the child process to complete while, in our example, the child process has successfully exec'ed and is actually waiting to communicate with the hung thread upon its return.
Posted Date : 2008-03-04 18:39:07.0
Work Around
N/A
Evaluation
The read in question from a private pipe assumes either success or EOF,
but an EINTR is not impossible, and so should be handled by retrying the operation.
Posted Date : 2008-03-04 22:16:41.0

Here is my (untested) attempt to get the logic completely right.
Does programming always have to be so hard?

diff --git a/src/solaris/native/java/lang/UNIXProcess_md.c b/src/solaris/native/java/lang/UNIXProcess_md.c
--- a/src/solaris/native/java/lang/UNIXProcess_md.c
+++ b/src/solaris/native/java/lang/UNIXProcess_md.c
@@ -479,6 +479,37 @@ closeSafely(int fd)
         close(fd);
 }
 
+/*
+ * Reads nbyte bytes from file descriptor fd into buf,
+ * The read operation is retried in case of EINTR or partial reads.
+ *
+ * Returns number of bytes read (normally nbyte, but may be less in
+ * case of EOF).  In case of read errors, returns -1 and sets errno.
+ */
+static ssize_t
+readFully(int fd, void *buf, size_t nbyte)
+{
+    ssize_t remaining = nbyte;
+    for (;;) {
+        ssize_t n = read(fd, buf, remaining);
+        if (n == 0) {
+            return nbyte - remaining;
+        } else if (n > 0) {
+            remaining -= n;
+            if (remaining <= 0)
+                return nbyte;
+            /* We were interrupted in the middle of reading the bytes.
+             * Unlikely, but possible. */
+            buf = (void *) (((char *)buf) + n);
+        } else if (errno == EINTR) {
+            /* Strange signals like SIGJVM1 are possible at any time.
+             * See http://www.dreamsongs.com/WorseIsBetter.html */
+        } else {
+            return -1;
+        }
+    }
+}
+
 #ifndef __solaris__
 #undef fork1
 #define fork1() fork()
@@ -606,9 +637,15 @@ Java_java_lang_UNIXProcess_forkAndExec(J
     /* parent process */
 
     close(fail[1]); fail[1] = -1; /* See: WhyCantJohnnyExec */
-    if (read(fail[0], &errnum, sizeof(errnum)) != 0) {
+
+    switch (readFully(fail[0], &errnum, sizeof(errnum))) {
+    case 0: break; /* Exec succeeded */
+    case sizeof(errnum):
         waitpid(resultPid, NULL, 0);
         throwIOException(env, errnum, "Exec failed");
+        goto Catch;
+    default:
+        throwIOException(env, errno, "Read failed");
         goto Catch;
     }
Posted Date : 2008-03-05 07:52:00.0

I am embarrassed to admit that I fell into the canonical "Worse" trap described by
the famous paper:

http://www.dreamsongs.com/WorseIsBetter.html

I am a "Right Thing" kind of guy myself, 
even more so after thinking about this bug.
Posted Date : 2008-03-05 23:22:56.0

Submitter has verified the fix.
Posted Date : 2008-03-06 02:53:17.0

It appears that this bug is a regression introduced by the fix for

4052517: (process) Runtime.exec won't execute programs belonging to other groups on Unix
4811767: (process) Runtime.exec should throw IOException when workdir does not exist (Unix)
5033302: (process) Can't execute Solaris NFS programs with uid>64k on Linux-amd64

in mustang-b81
Posted Date : 2008-03-14 18:47:15.0
Comments
  
  Include a link with my name & email   

Submitted On 25-MAR-2008
BharatD
I am seeing this issue for this version of java
---------
java version "1.5.0_10"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_10-b03)
Java HotSpot(TM) Server VM (build 1.5.0_10-b03, mixed mode)
-------

Has it been fixed for java 1.5 as well?



PLEASE NOTE: JDK6 is formerly known as Project Mustang