|
Quick Lists
|
|
Bug ID:
|
5014723
|
|
Votes
|
0
|
|
Synopsis
|
java/util/concurrent regression tests fail sporadically on multiprocessors
|
|
Category
|
hotspot:compiler2
|
|
Reported Against
|
tiger-beta2
|
|
Release Fixed
|
|
|
State
|
6-Fix Understood,
bug
|
|
Priority:
|
4-Low
|
|
Related Bugs
|
5031862
,
6666698
|
|
Submit Date
|
17-MAR-2004
|
|
Description
|
These JSR-166 tests (part of the J2SE workspace)
test/java/util/concurrent/locks/ReentrantLock/TimeoutLockLoops.java
test/java/util/concurrent/BlockingQueue/CancelledProducerConsumerLoops.java
are known to fail on machines with multiple CPUs, both
solaris-sparc and linux-amd64. It has never been seen to fail on
uniprocessor machines (tested on linux-i586, windows-i586).
The failures are seen on Tiger b43.
The failures are not easy to reproduce. I see a failure about 1% of the
time for both tests on both solaris-sparc and linux-amd64.
My recipe for reproducing it is to run an infinite loop till a failure
occurs, eg. for linux-amd64 I did:
( xxxxx@xxxxx ) java/util/concurrent/locks/ReentrantLock $ perl -e 'for ($i = 0;; $i++) { print "$i\n"; system("/u/martin/jct-tools/2.1.6/solaris/bin/jtreg -jdk:/u/martin/ws/tiger/build/linux-amd64/j2sdk-image -automatic TimeoutLockLoops.java"); exit 1 if $? != 0; }'
which failed after 98 iterations.
Here is an extract from a failing TimeoutLockLoops.jtr file:
----------System.out:(6/185)----------
Threads: 1 0.038286s run time
Threads: 2 0.039877s run time
Threads: 3 0.0399s run time
Threads: 5 0.039472s run time
Threads: 8 0.059474s run time
Threads: 12 0.079438s run time
----------System.err:(0/0)----------
result: Failed. Execution failed: Program `/u/martin/ws/tiger/build/linux-amd64/j2sdk-image/bin/java' interrupted! (timed out?)
test result: Failed. Execution failed: Program `/u/martin/ws/tiger/build/linux-amd64/j2sdk-image/bin/java' interrupted! (timed out?)
Here is an extract from a failing CancelledProducerConsumerLoops.jtr file:
----------System.out:(14/490)----------
Pairs:1
ArrayBlockingQueue 3.0553042s run time
LinkedBlockingQueue 2.2291932s run time
SynchronousQueue 2.485516s run time
Pairs:2
ArrayBlockingQueue 4.0495974s run time
LinkedBlockingQueue 3.4787056s run time
SynchronousQueue 3.558907s run time
Pairs:3
ArrayBlockingQueue 4.8086858s run time
LinkedBlockingQueue 3.1188338s run time
SynchronousQueue 3.3789728s run time
Pairs:5
ArrayBlockingQueue 3.3516484s run time
----------System.err:(0/0)----------
result: Failed. Execution failed: Program `/u/martin/ws/tiger/build/solaris-sparc/j2sdk-image/bin/java' interrupted! (timed out?)
test result: Failed. Execution failed: Program `/u/martin/ws/tiger/build/solaris-sparc/j2sdk-image/bin/java' interrupted! (timed out?)
I notice there are lingering processes after the test has stopped
running, which must be killed manually.
martin 15329 0.0 2.1 1344824 21604 pts/2 S 15:17 0:00 /u/martin/ws/tiger/build/linux-amd64/j2sdk-image/bin/java -Dtest.src=/export/martin/ws/canWrite/test/java/util/concurrent/BlockingQueue -Dtest.classes=/export/martin/ws/canWrite/test/java/util/concurrent/BlockingQueue/JTwork/classes/java/util/concurrent/BlockingQueue com.sun.javatest.regtest.MainWrapper /export/martin/ws/canWrite/test/java/util/concurrent/BlockingQueue/JTwork/classes/java/util/concurrent/BlockingQueue/CancelledProducerConsumerLoops.jta
xxxxx@xxxxx 2004-03-16
|
|
Work Around
|
N/A
|
|
Evaluation
|
xxxxx@xxxxx 2004-04-09
This is what I found. I got 2 kinds of hangs.
One when it hangs without any message on amd64 and I could
not connect gdb to it. The second with unreachable safepoint:
# SafepointSynchronize::begin: Fatal error:
# SafepointSynchronize::begin: Timed out while attempting to reach a safepoint.
# SafepointSynchronize::begin: Threads which did not reach the safepoint:
I reproduced the last hang on solaris_x86 (I know dbx better and I was able
to debug my binaries).
The loop "do {} while (n-- > 0);" in ReentrantLockLoop::run()
is countable even if the limit is "ITERS = Integer.MAX_VALUE".
We don't put safepoints in countable loops assuming it will finish
in reasonable time.
The scenario is simple: one thread acquired 'lock' (lock.tryLock() == true)
and now processing the loop body. Other threads are in barrier.await() or
waiting lock.tryLock(). At this time VM got a request for an operation and
forced all threads to go to safepoint state. All threads which are in
barrier.await() and lock.tryLock() go to safepoint (I have frames traces
which confirms this). The only thread left is one which processed the loop body.
And now the 'lock' can be easy acquired by this thread
since all other threads are in safepoint state and don't look for 'lock'.
The are only 2 places in a compiled method where we are placing
safepoints: before 'return' and at the end of uncountable loops or
loops where at least one call always executed (assuming we hit
the safepoint on the return from the call).
As I pointed above we don't have safepoint at the end of the loop.
Due to optimization the methods compute1() and compute2(x) are fully
inlined so we don't have safepoints for their 'return's.
The methods lock.tryLock() and lock.unlock() are inlined partially:
the code still has some nested calls (calls in tryLock() and unlock()).
But the execution only of the inlined parts of lock.tryLock() and lock.unlock()
is enough to acquire the 'lock' since all other threads are in safepoint state.
I executed the compiled code of ReentrantLockLoop::run() by steps in dbx
in this state and saw that we indeed never calls any functions.
This is why we avoid any safepoints until the end of iterations
and it is relatively long time (2**31-1 iterations).
On perf-lx2 (2x1.4GHz P3):
Threads: 18 0.225942996s run time
Threads: 27 397.664724057s run time
Exception in thread "main" java.lang.Error: Some thread completed instead of timing out
at TimeoutLockLoops$ReentrantLockLoop.test(TimeoutLockLoops.java:69)
at TimeoutLockLoops.main(TimeoutLockLoops.java:32)
The compiler should somehow figure out that a countable loop may take
long time to execute and place safepoint into it.
The fast solution (for 1.5.0 beta2) is to add a new flag to force
safepoint in countable loops. It should be 'false' by default since
the performance can be hit very hard. But we will get the workaround
for customers who get this problem.
And we can use the flag for these failed (timeout) test cases.
The possible long solution (for next release) is the implementation of
"strip mining" when you transform a countable loop into two nested
loops and place safepoint into external loop.
|
|
Comments
|
PLEASE NOTE: JDK6 is formerly known as Project Mustang
|
|
|
 |