According to the description, the customer is looking for the stack trace
on the standard output. The printStackTrace() method prints the stack trace
on the standard error, not standard output. However it does appear that the
log file do capture standard error becaue the traceback from the second
call to printStackTrace() does come out (the description implies that it
doesn't.) He should modify his program to write his extra debugging
output to System.err instead of System.out and capture stderr and stdout separately when running the program. The combined output to the 2 output
streams may not be intermixed in the same order as written.
His invocation of the "printstacktrace" script does not do what he wants.
The execution will run concurrently with his program, so the thread he
is interested in could have continued execution far beyond his call
to Runtime.exec(). He needs to add a call to MyProcess.waitFor()
after the call to Runtime.getRuntime().exec(). However this will
show the location in the exception handling code, not where the original
exception occurred. His analysis of where the NullPointerException
occurred from the thread dumps is incorrect, the thread dumps show
where the program execution had avanced to by the time the SIGQUIT
It does seem to be a bug that the printStackTrace() call on the original
exception object does not print the stack trace. This does work correctly
on a small test case I have written. Without more information on how to
reproduce the bug I cannot evaluate it any further.
Initially, I suspected it to be a compiler issue, but after inspecting the 2 cores at the given location, using SA, I could find very few compiled frames:
And the method CreateMapArguments() that is seeing the data loss, wasn't compiled in any of the cores.
I suspect the generated interpreter code right now. To make sure it's an interpreter issue & not a compiler problem, I would suggest a run with
java -Xcomp. And if that run fails, do another run with disabling compilation of above three methods.
New class file was developed as suggested and tested. In summary,
it is getting worse :-(
The class file is under /dumps/CSVLogReader.class, if needed.
There have been several crashes due to Null pointer exception.
For a sample, look in the file /t3-5/gatherer/g5/Log/Gatherer_stdout.log
towards the end of the logfile.
There was also another crash with little information. Pertinent details
are in /t3-1/gatherer/g29/Log/Gatherer_stdout.log and its corresponding
core file is /t3-1/gatherer/g29/Gatherer/core.
There were also Hashtable hang bug and its details are in
Using -Xcomp is NOT an acceptable workaround for several reasons.
- We do not ask a customer to run with un-documented flags in production.
- We very well know that using it is deterimetal to performance and
that's probably the reason we don't document it.
- This application's (ISVs) main competitive feature is performance,
the main reason for which I (MDE) am working with them now but
in the middle of these bugs. let me stop with this.......
The system is on SWAN. When I left I was told that there's some
maintanance work going on with the building router but that should
not affect this system. However, I was not able to get to my mail
in sfbay domain; but, I can access that E10K from here now.
So, the connectivity may be intermittent but please don't give up.
> Date: Wed, 20 Nov 2002 19:41:53 -0500 (GMT)
> From: ###@###.###
> Synopsis: NullPointerException just before data loss
> ** Nov 20 2002 5:56AM R,Anupam **
> Event: The test with -Xcomp doesn't reproduce the problem. So most probably it's the generated interpreter code, that is the culprit. I am currently looking at the genereated codelets for bytecodes. My suspecion right now is on iconst_m1 opcode, which will put a -1(0xffffffffffffffff) on the operand stack.
> Looking in CreateMapArguments() the place where this opcode is coming from is,
> boolean Found = false;
> Can you try a run, after making Found as "int" data type and using values of 1 & 2 as false & true respectively. This will avoid iconst_m1.
> But since, -Xcomp doen't reproduce the problem that can be used as workaround for the time being.
> I will continue to need the machine and I haven't made any modification to the set up on E10K today, as requested.
Looked at every word in the stack frame of failing method's frame using SA. I would expect the local reference which is getting corrupted to be present there, but didn't find any oop of type Util.MapArguments. I did see a null value in the region where other local oops were present. Knowing how interpreter frames are laid out, would help get the exact offset for this local. I think an earlier bytecode execution corrupted(null) this location, so next iload_2 loading from that location puts null on the operand stack.
Test doesn't fail when compilation of CrateMapArguments() is disabled and in -Xcomp mode.
Looking at TraceDeoptimization outputs, CrateMapArguments() is both compiled and deoptimized twice:
1. First one is a normal compile, and is deopt due to failed optimized checkcast. Deopt here looks proper and execution proceeds normally.
2. Second compile is an OSR compile, due to the for loop in CrateMapArguments(), and is deoptimized due to "unresolved class: Util/KDSServer". Excecution after this deopt gives a NullPointerException.
This points to a problem in setting up of interpreted frames after deopt.
looking at .class and PrintOpto, it looks solely a OSR compile only problem and deopt may not be the issue. Although, deopt packing showed map as NULL, deopt itself happened inside the catch block, and catch was executed because map was already null. PrintOpto shows that at calls to parse*() methods in the pseudo compiled code L2=#NULL(that's the place where local#2, map, will be kept). Whereas, for normal compiles(where there is no data corruption) L2 has good values. Compiler has assumed _L2(map) to be dead in a range where it should have been alive.
Attached file BadNullAssert.java shows a test case for a possible root cause of this bug.
The problem is subtle: An OSR method is compiled using a model of typestates which are
derived from flow analysis over the bytecodes. This flow analysis is necessarily incomplete
if the interpreter has not yet exercised all paths in the code, leaving some classes still
unloaded. In effect the CFG for the method grows over time, as the interpreter exercises
new paths and loads new classes. Exercising a new path must invalidate pre-existing OSR
methods, since they may have compiled into them assumptions about typestates that must
change as new CFG edges come into being.
The problem can manifest (in the test case above, for example) when an unloaded class is
the subject of a checkcast operation. The flow analysis that precedes an OSR compilation
will simply assert that any value casted to an unloaded class must be null, and the
compilation proceeds on that assumption. Later on, if the class loads, the interpreter
can checkast non-null values successfully, leading to typestates that were previously
proved impossible. The OSR method can be invoked on a typestate created by the interpreter
containing non-null values, and yet the OSR code will assume that they are null, based
on the out-of-date type analysis.
Solution is to have OSR methods register dependencies on unloaded classes, and get flushed
if the classes ever load. This is not an issue with regular (non-OSR) methods, since
a normal method creates all its own typestates, from the method's entry point forward.
There is no way for the interpreter dump surprising typestates into a non-OSR method.
A binary which fixes this bug is here: /net/lukasiewicz/export/lukasiewicz1/4761344
There are product, optimized, and fastdebug flavors. It is based on the current 1.4.2
To gather more information, testing should use the following extra flags:
-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation -XX:LogFile=hotspot.log
The file "hotspot.log" will be written with diagnostic information.
The name can be adjusted to any pathname.
The libjvm.so binaries containing the fix are under this directory:
The 32-bit VM builds are:
fastdebug/libjvm_g.so asserts enabled, compiled optimized, for java_g
optimized/libjvm.so debugging options compiled in
product/libjvm.so product build
The corresponding 64-bit VM builds are:
All of these were built from the current 1.4.2 baseline, but they
can run inserted into the 1.4.1 JRE.
To verify installation, start a VM with -Xinternalversion, and make
sure the version string includes "jrose in mantis".
If these binaries do not fix the problem, please re-run with the
previously suggested three options including +LogCompilation.
After some intensive analysis, we are more confident the root problem is the compiler
erroneously deciding that a particular class is unloaded, due to the fact that it is
defined in a delegate class loader and not directly visible in the current class loader.
We believe we can fix this cleanly and reliably by using the class loader constraints
recorded by the VM, to correctly report that the class in question is in fact loaded
in the VM, even though it has not yet been fetched from the local class loader.
I am working on a VM that includes this adjustment. Meanwhile, run with logging
turned on, and the log file set to a non-existent file, such as "/no/such/file".
This will cause the VM to choose a unique file in /tmp/, avoiding the problem
with multiple VMs overwriting a single log.
There is an updated VM which fixes the problems mentioned above.
Please test with it.
The 64-bit VM builds are:
Recent JDC comments report that there is some variation of
this bug active in 1.4.2. The particular bug fixed under
this report had to do with the compiler getting confused
by aliasing of classes through different clas loaders,
causing it to "know" that a given class was not yet loaded
and therefore all references "must" be null--leading to
embarassing spurious NPEs.
On top of this problem, up until 1.5 the compiler optimizes
the treatment of NPEs by using a preallocated NPE object,
rather than creating it every time. The preallocated NPE
object works great, except that it lacks a backtrace.
A backtrace-free NPE makes it hard to diagnose any NPE bug,
whether it comes from user error (usually) or a JVM bug (rarely).
We have opened a separate bug 4292742 for the missing backtrace,
and fixed it in 1.5beta2.
This backtrace problem is probably bothering people on 1.4.2 release.
It is probably making it difficult to debug application errors in
in compiled code, forcing them to run -Xint or -client to debug.
If this is true for you, I suggest transferring your JDC vote to bug
4292742, so we can consider backporting a version of the 1.5 fix.
On the other hand, NPE exceptions that are truly spurious, and are
known to be due to JVM errors, should probably be reported as new bugs.
However, if they are clearly cases of the complex syndrome described
in this bug, we would appreciate an explicit note to this effect,
so we can re-evaluate the fix to the class loading logic.