Java Solaris Communities Sun Store Join SDN My Profile Why Join?
 
Bug Database
Bug Detail
Quick Lists
Top 25 Bugs
Top 25 RFE's
Recently Closed Bugs
Printable Page Printable Page


Bug Database
Bug ID: 5014723
Votes 0
Synopsis java/util/concurrent regression tests fail sporadically on multiprocessors
Category hotspot:compiler2
Reported Against tiger-beta2
Release Fixed
State 6-Fix Understood, bug
Priority: 4-Low
Related Bugs 5031862 , 6666698
Submit Date 17-MAR-2004
Description
These JSR-166 tests (part of the J2SE workspace)

test/java/util/concurrent/locks/ReentrantLock/TimeoutLockLoops.java
test/java/util/concurrent/BlockingQueue/CancelledProducerConsumerLoops.java

are known to fail on machines with multiple CPUs, both
solaris-sparc and linux-amd64.  It has never been seen to fail on
uniprocessor machines (tested on linux-i586, windows-i586). 
 
The failures are seen on Tiger b43.

The failures are not easy to reproduce.  I see a failure about 1% of the
time for both tests on both solaris-sparc and linux-amd64.

My recipe for reproducing it is to run an infinite loop till a failure
occurs, eg. for linux-amd64 I did:

(  xxxxx@xxxxx  ) java/util/concurrent/locks/ReentrantLock $ perl -e 'for ($i = 0;; $i++) { print "$i\n"; system("/u/martin/jct-tools/2.1.6/solaris/bin/jtreg -jdk:/u/martin/ws/tiger/build/linux-amd64/j2sdk-image -automatic TimeoutLockLoops.java"); exit 1 if $? != 0; }'

which failed after 98 iterations.

Here is an extract from a failing TimeoutLockLoops.jtr file:

----------System.out:(6/185)----------
Threads: 1	 0.038286s run time
Threads: 2	 0.039877s run time
Threads: 3	 0.0399s run time
Threads: 5	 0.039472s run time
Threads: 8	 0.059474s run time
Threads: 12	 0.079438s run time
----------System.err:(0/0)----------
result: Failed. Execution failed: Program `/u/martin/ws/tiger/build/linux-amd64/j2sdk-image/bin/java' interrupted! (timed out?)


test result: Failed. Execution failed: Program `/u/martin/ws/tiger/build/linux-amd64/j2sdk-image/bin/java' interrupted! (timed out?)

Here is an extract from a failing CancelledProducerConsumerLoops.jtr file:

----------System.out:(14/490)----------
Pairs:1
ArrayBlockingQueue      	 3.0553042s run time
LinkedBlockingQueue     	 2.2291932s run time
SynchronousQueue        	 2.485516s run time
Pairs:2
ArrayBlockingQueue      	 4.0495974s run time
LinkedBlockingQueue     	 3.4787056s run time
SynchronousQueue        	 3.558907s run time
Pairs:3
ArrayBlockingQueue      	 4.8086858s run time
LinkedBlockingQueue     	 3.1188338s run time
SynchronousQueue        	 3.3789728s run time
Pairs:5
ArrayBlockingQueue      	 3.3516484s run time
----------System.err:(0/0)----------
result: Failed. Execution failed: Program `/u/martin/ws/tiger/build/solaris-sparc/j2sdk-image/bin/java' interrupted! (timed out?)


test result: Failed. Execution failed: Program `/u/martin/ws/tiger/build/solaris-sparc/j2sdk-image/bin/java' interrupted! (timed out?)

I notice there are lingering processes after the test has stopped
running, which must be killed manually.

martin   15329  0.0  2.1 1344824 21604 pts/2 S    15:17   0:00 /u/martin/ws/tiger/build/linux-amd64/j2sdk-image/bin/java -Dtest.src=/export/martin/ws/canWrite/test/java/util/concurrent/BlockingQueue -Dtest.classes=/export/martin/ws/canWrite/test/java/util/concurrent/BlockingQueue/JTwork/classes/java/util/concurrent/BlockingQueue com.sun.javatest.regtest.MainWrapper /export/martin/ws/canWrite/test/java/util/concurrent/BlockingQueue/JTwork/classes/java/util/concurrent/BlockingQueue/CancelledProducerConsumerLoops.jta


  xxxxx@xxxxx   2004-03-16
Work Around
N/A
Evaluation
  xxxxx@xxxxx   2004-04-09

This is what I found. I got 2 kinds of hangs.
One when it hangs without any message on amd64 and I could 
not connect gdb to it. The second with unreachable safepoint:

# SafepointSynchronize::begin: Fatal error:
# SafepointSynchronize::begin: Timed out while attempting to reach a safepoint.
# SafepointSynchronize::begin: Threads which did not reach the safepoint:

I reproduced the last hang on solaris_x86 (I know dbx better and I was able
to debug my binaries).

The loop "do {} while (n-- > 0);" in ReentrantLockLoop::run()
is countable even if the limit is "ITERS = Integer.MAX_VALUE".
We don't put safepoints in countable loops assuming it will finish
in reasonable time.

The scenario is simple: one thread acquired 'lock' (lock.tryLock() == true)
and now processing the loop body. Other threads are in barrier.await() or
waiting lock.tryLock(). At this time VM got a request for an operation and 
forced all threads to go to safepoint state. All threads which are in 
barrier.await() and lock.tryLock() go to safepoint (I have frames traces
which confirms this). The only thread left is one which processed the loop body.
And now the 'lock' can be easy acquired by this thread
since all other threads are in safepoint state and don't look for 'lock'. 

The are only 2 places in a compiled method where we are placing
safepoints: before 'return' and at the end of uncountable loops or
loops where at least one call always executed (assuming we hit
the safepoint on the return from the call).

As I pointed above we don't have safepoint at the end of the loop.
Due to optimization the methods compute1() and compute2(x) are fully 
inlined so we don't have safepoints for their 'return's. 
The methods lock.tryLock() and lock.unlock() are inlined partially:
the code still has some nested calls (calls in tryLock() and unlock()).
But the execution only of the inlined parts of lock.tryLock() and lock.unlock()
is enough to acquire the 'lock' since all other threads are in safepoint state.

I executed the compiled code of ReentrantLockLoop::run() by steps in dbx
in this state and saw that we indeed never calls any functions.
This is why we avoid any safepoints until the end of iterations 
and it is relatively long time (2**31-1 iterations). 
On perf-lx2 (2x1.4GHz P3):

Threads: 18      0.225942996s run time
Threads: 27      397.664724057s run time
Exception in thread "main" java.lang.Error: Some thread completed instead of timing out
        at TimeoutLockLoops$ReentrantLockLoop.test(TimeoutLockLoops.java:69)
        at TimeoutLockLoops.main(TimeoutLockLoops.java:32)

The compiler should somehow figure out that a countable loop may take 
long time to execute and place safepoint into it. 

The fast solution (for 1.5.0 beta2) is to add a new flag to force
safepoint in countable loops. It should be 'false' by default since
the performance can be hit very hard. But we will get the workaround
for customers who get this problem. 
And we can use the flag for these failed (timeout) test cases.

The possible long solution (for next release) is the implementation of
"strip mining" when you transform a countable loop into two nested
loops and place safepoint into external loop.
Comments
  
  Include a link with my name & email   


PLEASE NOTE: JDK6 is formerly known as Project Mustang