United StatesChange Country, Oracle Worldwide Web Sites Communities I am a... I want to...
Bug ID: 5005861 native libraries for java.lang.Math should be non-strict
5005861 : native libraries for java.lang.Math should be non-strict

Details
Type:
Enhancement
Submit Date:
2004-03-01
Status:
Open
Updated Date:
2009-04-09
Project Name:
JDK
Resolved Date:
Component:
core-libs
OS:
windows_xp,windows_2000
Sub-Component:
java.lang
CPU:
x86
Priority:
P4
Resolution:
Unresolved
Affected Versions:
1.4.2,5.0,6
Targeted Versions:

Related Reports
Relates:
Relates:
Relates:
Relates:

Sub Tasks

Description
Name: rmT116609			Date: 03/01/2004


A DESCRIPTION OF THE REQUEST :
The transcedental functions in java.lang.Math, like trigonometric or
exponential methods, have dismal performance when compared to similar
functions from C/C++ libraries (or virtually any compiled language).
Java's performance is limited by the stringent requirements of IEEE
adherence, but this should impose only a modest overhead over other
languages. Investigating the SCSL sources, it's easy to see that the root of
all slowness lies in the fldlibm native library, which implements the
native methods from java.lang.StrictMath with an approach that compares
to emulated FP; that is, the libray doesn't exploit the FP opcodes of
FPUs, not even for the "kernel" of the calculation that FPUs could do.

For example, java.lang.StrictMath.tan() calls a JNI stub, that calls
s_tan.c:tan(), that calls k_tan.c:__kernel_tan().  These two C functions
implement the tangent with a complicated algorithm that never makes use
of the FPU's tangent opcode (e.g., FTAN in Intel chips). This doesn't make
the algorithm FPU-free, because the C compiler will generate the
architecture-specific opcodes for fundamental operations used in fldlibm,
like +,-,*,/; it's only the critical FTAN instruction that's avoided, but that's the
one instruction that could replace 90% of the C code -- and run one order
of magnitude faster than that code. In fact, debugging inside a C/C++
compiler's equivalent tan() library function, it's possible to see that FTAN
does all the hard work (with a few extra instructions to load arguments or
handle special cases like infinites).

This approach in fldlibm is probably the only way to implement "strict"
maths, so the results are bit-per-bit identical in all platforms; the
microcode in FPUs often use simpler algorithms than those in fldlibm,
so they may produce less precise results or not obey all requisites from
the Java spec, such as semi-monotonicity.

Unfortunately, all maths are strict in Java, because the transcedental
methods in java.lang.Math are all stubs that invoke the same method in
java.lang.StrictMath.  I can understand the tradeoff in StrictMath, but
in the standard ("loose") Math class, we should favor best performance.
Notice that most semantics dictated by the specification of the functions
(i.e., support for all special values) could still be implemented with
checks performed before invoking the FPU instruction.  It's possible
[I'm not a Math expert] that 1ulp precision and semi-monotonicity cannot
be implemented easily (in the general case (*)) around the FPU opcode;
but if this is a problem, the spec should be fixed; these requirements
clearly exceed the needs of loose maths.

In a similar fashion, the methods min() and max() should be alleviated
from the demands of supporting NaN and signed zeros; due to this,
Java's min()/max() are a lot slower than in most other languages (where
similar functions are defined trivially, as "x >= y ? x : y").  In fact, most Java developers completely ignore the overheads of min()/max() because such
methods are so common in all languages, that a new comer to Java would
hardly care to read the docs or the sources (what could be simpler than
min and max?). But min/max can be critical to some tight loops (e.g. DSP).
Once again, I think java.lang.StrictMath is okay (it's great to have a
max() that reports that +0 is bigger than -0, if I ever need this); but in
java.lang.Math, the spec should enable the trivial implementation.

(*) When the FPU can do calculations with extended precision, i.e. in the
Pentium: using 64-bit registers for floats or 80-bit registers for doubles,
it's highly likely that the results of transcedental opcodes will be exact at
least to the last bit used by the Java type (32 for floats, 64 for doubles).
This means that the current requisites of precision and semi-monotonicity
would certainly be satisfied, and only bit-per-bit reproducibility of results
might be lost, so this optimization could be performed without any spec
changes at least in architectures supporting extended-precision FP.
Even in platforms not supporting FP numbers larger than 64 bits, the
optimization could be done for floats at the very least.

JUSTIFICATION :
Many years have gone by since the introduction of StrictMath/strictfp,
so the people who really care about such wizardry as signed zeroes or
multiplatform-reproducible results are either using StrictMath explicitly,
or deserving any problem they could have with the relaxation of the loose
methods.  On the other hand, the people who don't need those advanced
features but need maximum FP performance currently have no choice
(except for custom math libs, and most would require custom JNI code).
If there is a risk that further relaxing of java.lang.Math could break some
apps (that *shouldn't* be using this class in the first place), this is certainly
a lesser problem than the severe hit currently imposed for those who want
to code scientific and engineering apps, or advanced 3D games, in Java.



EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Performance should bw similar to C/C++; for example, benchmarking tan()
shows a performance of 12ns per call (VC++ 2003,  PentiumIV-2,25GHz)
ACTUAL -
The Java version of this benchmark shows 148ns per call to tan() (either the
java.lang.Math or java.lang.StrictMath).  This is 12X slower (HotSpot 1.4.2 -server -Xbatch).  The same problem hapopens with several methods in java.lang.Math.

---------- BEGIN SOURCE ----------
Java benchmark:

public class Tan {
 static void benchLoose () {
  long t = System.currentTimeMillis();
  for (double i = 0; i < 10000000.0; ++i) Math.tan(i);
   System.out.println("Loose:  "+(System.currentTimeMillis()-t)/100+"ns/call");
 }
 static strictfp void benchStrict ()	{
  long t = System.currentTimeMillis();
  for (double i = 0; i < 10000000.0; ++i) StrictMath.tan(i);
   System.out.println("Strict: "+(System.currentTimeMillis()-t)/100+"ns/call");
 }
 public static void main (String[] args)	{
  while (true) {
   benchLoose();
   benchStrict();
  }
 }
}

C++ benchmark:

#include <cmath>
#include <cstdio>
#include <ctime>
void bench () {
 clock_t t = clock();
 for (double i = 0; i < 10000000.0; ++i) tan(i);
 printf("C++:  %fns/call\n", (clock() - t) / 100.0);
}
void main (char** args)	{
 while (true)
  bench();
}

Note: even though both are microbenchmarks and I test with full optimization,
I have analysed the generated code to be sure that tan() is being called.
---------- END SOURCE ----------
(Incident Review ID: 234064) 
======================================================================

                                    

Comments
CONVERTED DATA

BugTraq+ Release Management Values

COMMIT TO FIX:
mustang


                                     
2004-09-27
EVALUATION

Thank you for your interest in Java numerics.

A few preliminaries:

1. the language's default floating-point semantics vs. strictfp
floating-point semantics and the semantics of Math vs. StrictMath
libraries are separate issues that are not directly related

2. neither the original 1985 IEEE 754 floating-point standard nor the
draft revised standard covers the behavior of any "interesting" math
library functions other than square root.  The 754 revision committee
explicitly decided not to attempt to specify the behavior of "easy"
functions like log and exp.  However, the 754 revision has added
min/max operations that require "proper" treatment of signed zeros,
etc., similar to the java.lang.{Strict}Math specifications for that
operation.

One of the driving design goals of Java's original floating-point
semantics was bit-wise cross-platform reproducibility for both basic
arithmetic operations and math library results.  Specifying such a
result for the basic arithmetic operations is relatively easy because
of the ubiquity of 754 and because the "correct" answer is easy to
understand and feasible to achieve: assuming the inputs and exact,
return the floating-point number nearest to the exact mathematical
result.  While this notion of a "correctly rounded" function is as
applicable to {+, -, *, /} as to sin, cos, exp, etc., creating usably
fast correctly rounded implementation of the latter class of functions
remains something of a research problem.  Therefore, to achieve the
same level of reproducibility for math library functions, it is
necessary to use a particular algorithm.  Java (and Matlab) use the
fdlibm libraries for this purpose.  Starting in 1.3, the StrictMath
class required use of the fdlibm algorithms while corresponding
methods in the Math class were allowed to use any implementation that
meets the stated quality of implementation requirements, usually in
terms of accuracy and monotonicity.

The fdlibm algorithms satisfy the Math quality of implementation
requirements so the current *source code* which delegates Math calls
to StrictMath is correct.  However, how an optimizing jvm, like
HotSpot, implements the math library might bear little resemblance to
the source code.  For example, the sqrt method is commonly replaced by
the square root instruction on the CPU in question.  This
"intrinsification" optimization is done in the jvm and is not
represented in source code.

We are aware of the almabench results and and osnews article on
trigonometric performance.  However, the HotSpot implementation of
sin/cos on x86 for years has used and continues to use fsin/fcos x87
instructions in a range where those instructions meet the quality of
implementation requirements, basically [-pi/4, pi/4].  Outside of that
range, the results of fsin/fcos can be anywhere in the range [-1, 1]
with little relation to the true sine/cosine of the argument.  For
example, fsin(Math.PI) only gets about half the digits of the result
correct.  The reason for this is that the fsin/fcos instruction
implementations use a less than ideal algorithm for *argument
reduction*; the argument reduction process is explained in bug
4857011.

Although the range [-pi/4, pi/4] covers more than half of all
floating-point values, osnews et al. benchmark outside of this range.
For larger values, higher precision arithmetic is needed to compute
the (nearly) right answer.  Therefore, outside of [-pi/4, pi/4], the
speed of a compliant java.lang.Math sin/cos routine will largely be a
function of the argument's exponent.  Neither osnews nor almabench
examine the numerical results so the quality of the implementation is
not examined.

Intrinsifing more math library functions should be investigated in a
future release.

###@###.### 2004-03-11

Tan, LN and Log10 have all been intrinsified in Mustang.
###@###.### 2004-09-16
                                     
2004-03-11



Hardware and Software, Engineered to Work Together