United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: 6671051 (process) Runtime.exec() hangs if signalled during fork/exec
6671051 : (process) Runtime.exec() hangs if signalled during fork/exec

Details
Type:
Bug
Submit Date:
2008-03-04
Status:
Closed
Updated Date:
2011-05-18
Project Name:
JDK
Resolved Date:
2011-05-18
Component:
core-libs
OS:
solaris_10
Sub-Component:
java.lang
CPU:
generic
Priority:
P2
Resolution:
Fixed
Affected Versions:
6u3
Fixed Versions:
7

Related Reports
Backport:
Backport:
Relates:
Relates:
Relates:

Sub Tasks

Description
We are seeing threads hung in java.lang.Runtime.exec() on Solaris 10. The child process is created, but the Runtime.exec() call never returns.

In this example the calling thread is 103(0x67) and the child pid is 14159(0x374f) - I didn't capture the output that shows that.

Here is the backtrace (from jstack). 

"CacheManagerTaskExecutor-15" prio=10 tid=0x0840d400 nid=0x67 sleeping[0xb48a7000..0xb48a7b70]
   java.lang.Thread.State: RUNNABLE
        at java.lang.UNIXProcess.forkAndExec(Native Method)
        at java.lang.UNIXProcess.<init>(UNIXProcess.java:53)
        at java.lang.ProcessImpl.start(ProcessImpl.java:65)
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:452)
        at java.lang.Runtime.exec(Runtime.java:593)
        at java.lang.Runtime.exec(Runtime.java:431)
        at java.lang.Runtime.exec(Runtime.java:328)
        at com.sun.hss.domain.util.misc.ExpectCommandExecutor.execute(ExpectCommandExecutor.java:48)
        - locked <0xbc1e72d0> (a com.sun.hss.domain.util.misc.ExpectCommandExecutor)
        at com.sun.hss.domain.util.ipmi.IPMICommand.exec(IPMICommand.java:168)
        at com.sun.hss.domain.util.ipmi.IPMICommand.exec(IPMICommand.java:145)
        at com.sun.hss.domain.driver.generic.GenericServerMXBeanHelper.execIPMICommand(GenericServerMXBeanHelper.java:406)
        at com.sun.hss.domain.util.ipmi.AbstractIPMICommandHelper.getFruMap(AbstractIPMICommandHelper.java:279)
        at com.sun.hss.domain.driver.v20z.V20zServerMXBeanHelper.getSerialNumber(V20zServerMXBeanHelper.java:211)
        at com.sun.hss.domain.driver.generic.AbstractServerMXBean.getSerialNumber(AbstractServerMXBean.java:87)
        at sun.reflect.GeneratedMethodAccessor164.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.sun.jmx.mbeanserver.ConvertingMethod.invokeWithOpenReturn(ConvertingMethod.java:167)
        at com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:96)
        at com.sun.jmx.mbeanserver.MXBeanIntrospector.invokeM2(MXBeanIntrospector.java:33)
        at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:208)
        at com.sun.jmx.mbeanserver.PerInterface.getAttribute(PerInterface.java:65)
        at com.sun.jmx.mbeanserver.MBeanSupport.getAttribute(MBeanSupport.java:216)
        at javax.management.StandardMBean.getAttribute(StandardMBean.java:358)
        at com.sun.hss.domain.AbstractResourceMXBean.getAttribute(AbstractResourceMXBean.java:209)
        at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:666)
        at com.sun.jmx.mbeanserver.JmxMBeanServer.getAttribute(JmxMBeanServer.java:638)
        at com.sun.jdmk.interceptor.DefaultMBeanServerInterceptor.getAttribute(DefaultMBeanServerInterceptor.java:178)
        at com.sun.hss.domain.ForwardingMBeanServerInterceptor.getAttribute(ForwardingMBeanServerInterceptor.java:178)
        at com.sun.hss.domain.ConcurrentMBeanServerInterceptor.getAttribute(ConcurrentMBeanServerInterceptor.java:837)
        at com.sun.hss.domain.ForwardingMBeanServerInterceptor.getAttribute(ForwardingMBeanServerInterceptor.java:178)
        at com.sun.hss.domain.ForwardingMBeanServerInterceptor.getAttribute(ForwardingMBeanServerInterceptor.java:178)
        at com.sun.cacao.agent.DispatchInterceptor.getAttribute(DispatchInterceptor.java:506)
        at com.sun.cacao.agent.auth.impl.AccessControlInterceptor.getAttribute(AccessControlInterceptor.java:310)
        at com.sun.jdmk.JdmkMBeanServerImpl.getAttribute(JdmkMBeanServerImpl.java:640)
        at com.sun.cacao.cached.impl.server.ServerCachedMBeanProxy.rereadAttribute(ServerCachedMBeanProxy.java:326)
        at com.sun.cacao.cached.impl.server.ServerCachedMBeanProxy.refreshCachedAttributes(ServerCachedMBeanProxy.java:295)
        at com.sun.cacao.cached.impl.server.RefreshAttributeTask.execute(RefreshAttributeTask.java:60)
        at com.sun.cacao.cached.impl.server.TaskPool$TaskExecutor.run(TaskPool.java:334)


pstack shows the following for the thread (note that the 374f parameter corresponds to the child in question):
-----------------  lwp# 103 / thread# 103  --------------------
 fef30ee7 waitid   (0, 374f, b48a6fc0, 3)
 fef25b52 waitpid  (374f, 0, 0) + 75
 fe488242 Java_java_lang_UNIXProcess_forkAndExec (840d4ec, b48a7170, b48a716c, b48a7168, 7, 0) + 6ca
 fb40a598 ???????? (0, fb408219, bc1e7260, bc1e7250, bc1e7240, 0)
 fb402de9 ???????? (0, 0, 0, 0, 7, bc1e71f0)
 fb7f62b0 ???????? ()

This appears to be stuck in the following code in Java_java_lang_UNIXProcess_forkAndExec():
    close(fail[1]); fail[1] = -1; /* See: WhyCantJohnnyExec */
    if (read(fail[0], &errnum, sizeof(errnum)) != 0) {
        waitpid(resultPid, NULL, 0);
        throwIOException(env, errnum, "Exec failed");
        goto Catch;


Although this is highly reproducible in our lab stress tests, it was hard to produce when running truss. However, in this case I was able to capture it with truss -l -t read,waitid. Here is the truss output from the time of the failure (note the lwp id is 103, consistent with the above):
/103:         siginfo: SIGJVM1 pid=13520 uid=0 code=-1
/103:    0.0024 read(309, 0xB48A70B4, 4)                        Err#4 EINTR
/103:   waitid(P_PID, 14159, 0xB48A6FC0, WEXITED|WTRAPPED) (sleeping...)

It looks like there is a bug in the code above. It assumes that a non-zero return from the read system call represents a successful read of errnum, while in this case, the return value is actually -1. This causes the calling thread to wait for the child process to complete while, in our example, the child process has successfully exec'ed and is actually waiting to communicate with the hung thread upon its return.

                                    

Comments
EVALUATION

It appears that this bug is a regression introduced by the fix for

4052517: (process) Runtime.exec won't execute programs belonging to other groups on Unix
4811767: (process) Runtime.exec should throw IOException when workdir does not exist (Unix)
5033302: (process) Can't execute Solaris NFS programs with uid>64k on Linux-amd64

in mustang-b81
                                     
2008-03-14
EVALUATION

Submitter has verified the fix.
                                     
2008-03-06
EVALUATION

Here is my (untested) attempt to get the logic completely right.
Does programming always have to be so hard?

diff --git a/src/solaris/native/java/lang/UNIXProcess_md.c b/src/solaris/native/java/lang/UNIXProcess_md.c
--- a/src/solaris/native/java/lang/UNIXProcess_md.c
+++ b/src/solaris/native/java/lang/UNIXProcess_md.c
@@ -479,6 +479,37 @@ closeSafely(int fd)
         close(fd);
 }
 
+/*
+ * Reads nbyte bytes from file descriptor fd into buf,
+ * The read operation is retried in case of EINTR or partial reads.
+ *
+ * Returns number of bytes read (normally nbyte, but may be less in
+ * case of EOF).  In case of read errors, returns -1 and sets errno.
+ */
+static ssize_t
+readFully(int fd, void *buf, size_t nbyte)
+{
+    ssize_t remaining = nbyte;
+    for (;;) {
+        ssize_t n = read(fd, buf, remaining);
+        if (n == 0) {
+            return nbyte - remaining;
+        } else if (n > 0) {
+            remaining -= n;
+            if (remaining <= 0)
+                return nbyte;
+            /* We were interrupted in the middle of reading the bytes.
+             * Unlikely, but possible. */
+            buf = (void *) (((char *)buf) + n);
+        } else if (errno == EINTR) {
+            /* Strange signals like SIGJVM1 are possible at any time.
+             * See http://www.dreamsongs.com/WorseIsBetter.html */
+        } else {
+            return -1;
+        }
+    }
+}
+
 #ifndef __solaris__
 #undef fork1
 #define fork1() fork()
@@ -606,9 +637,15 @@ Java_java_lang_UNIXProcess_forkAndExec(J
     /* parent process */
 
     close(fail[1]); fail[1] = -1; /* See: WhyCantJohnnyExec */
-    if (read(fail[0], &errnum, sizeof(errnum)) != 0) {
+
+    switch (readFully(fail[0], &errnum, sizeof(errnum))) {
+    case 0: break; /* Exec succeeded */
+    case sizeof(errnum):
         waitpid(resultPid, NULL, 0);
         throwIOException(env, errnum, "Exec failed");
+        goto Catch;
+    default:
+        throwIOException(env, errno, "Read failed");
         goto Catch;
     }
                                     
2008-03-05
EVALUATION

I am embarrassed to admit that I fell into the canonical "Worse" trap described by
the famous paper:

http://www.dreamsongs.com/WorseIsBetter.html

I am a "Right Thing" kind of guy myself, 
even more so after thinking about this bug.
                                     
2008-03-05
EVALUATION

The read in question from a private pipe assumes either success or EOF,
but an EINTR is not impossible, and so should be handled by retrying the operation.
                                     
2008-03-04



Hardware and Software, Engineered to Work Together