|
Description
|
A DESCRIPTION OF THE REQUEST :
Current JVMs implement Math.abs() using compares and branches even though there are far more efficient code sequences for computing the absolute value of a floating point value that do not require a branches. These non-branching code sequences are both much shorter and cheaper than the compare and branch code currently emitted by hotspot (in both the client and server JVMs).
For example, x87 has a fabs instruction which is slightly cheaper than an add and SSE uses a boolean & instructions (andsd for doubles andss for floats) which has the same cost as an add instruction. However as the timing results below show, currently Math.abs() is much more expensive than an add.
JUSTIFICATION :
Math.abs() is currently much slower and more expensive than it should be. It costs around six times more than an add when it should cost the same or even be slightly cheaper. The cost of Math.abs() can be significant in some codes (including some of ours) that make heavy use of it. Using the more efficient code sequences would signifcantly speed up such programs.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Expected behaviour is that Math.abs(double) and Math.abs(float) should cost about the same as an add.
ACTUAL -
Actual behaviour is that Math.abs() is much more expensive than an add as show in the timing results for the server and client 1.5.0rc JVMs below.
java version "1.5.0-rc"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-rc-b63)
Java HotSpot(TM) Server VM (build 1.5.0-rc-b63, mixed mode)
avg 2.3 ns total 4.68E-1 s for assign (~ 4.0 cycles)
avg 2.5 ns total 5.00E-1 s for add (~ 4.2 cycles)
avg 17.0 ns total 3.41E0 s for Math.abs() (~ 29.0 cycles)
avg 2.4 ns total 4.85E-1 s for assign (~ 4.1 cycles)
avg 2.5 ns total 5.00E-1 s for add (~ 4.2 cycles)
avg 17.0 ns total 3.39E0 s for Math.abs() (~ 28.8 cycles)
java version "1.5.0-rc"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0-rc-b63)
Java HotSpot(TM) Client VM (build 1.5.0-rc-b63, mixed mode, sharing)
avg 5.0 ns total 1.00E0 s for assign (~ 8.5 cycles)
avg 5.3 ns total 1.06E0 s for add (~ 9.0 cycles)
avg 30.8 ns total 6.16E0 s for Math.abs() (~ 52.3 cycles)
avg 5.0 ns total 1.00E0 s for assign (~ 8.5 cycles)
avg 5.4 ns total 1.08E0 s for add (~ 9.2 cycles)
avg 30.6 ns total 6.11E0 s for Math.abs() (~ 51.9 cycles)
---------- BEGIN SOURCE ----------
import java.text.DecimalFormat;
import java.util.Random;
/** Test that show that Math.abs is much slower than a floating point add
* even though it should have about the same cost
*
* @author bjw Bruce Walter, Cornell Program of Computer Graphics 2004
*/
public class AbsTest {
//target total number of repetitions of the operation
public static final int opTargetRepetitions = 200000000;
//size of arrays that are operated on
public static final int arraySize = 10000;
//number of times we need to process each array to reach total target count
public static final int reps = opTargetRepetitions/arraySize;
//pretty print the output numbers to make them easier to read
public static final DecimalFormat decForm = new DecimalFormat("###0.0");
public static final DecimalFormat sciForm = new DecimalFormat("0.00E0");
//my processor is a 1.7GHz Xeon (actually it is a dual processor, but this test is single threaded)
public static final double ghzProcSpeed = 1.7; //my processor is 1.7GHz
public static void runTimingTest(TestOp op, double result[], double src[], boolean print) {
long time = System.currentTimeMillis();
for(int i=0;i<reps;i++) {
op.performOp(result,src);
}
time = System.currentTimeMillis() - time;
double denom = 1000000.0/(reps*src.length);
if (print) {
String ps = decForm.format(time*denom);
while (ps.length()<6) ps = " "+ps;
ps = "avg "+ps+" ns total "+sciForm.format(time/1000.0)+" s";
while (ps.length()<32) ps += " ";
ps = ps+" for "+op.toString();
while (ps.length()<50) ps += " ";
System.out.println(ps+"\t(~ "+decForm.format(time*denom*ghzProcSpeed)+" cycles)");
}
}
public static void main(String[] args) throws InterruptedException {
double src[] = new double[arraySize];
double result[] = new double[arraySize];
Random ran = new Random(5232482349538L);
//set the src array to be random values between -1 and 1 (but excluding zero)
for(int i=0;i<src.length;i++) {
do {
src[i] = 2*ran.nextDouble() - 1.0;
} while (src[i] == 0);
}
TestOp tests[] = { new AssignOp(), new AddOp(), new AbsOp()};
//warm up hotspot
for(int i=0;i<tests.length;i++) {
runTimingTest(tests[i],result,src,false);
}
//now run the real tests and print the timings
for(int i=0;i<tests.length;i++) {
runTimingTest(tests[i],result,src,true);
}
//do it again to show the timings are reasonably consistent
for(int i=0;i<tests.length;i++) {
runTimingTest(tests[i],result,src,true);
}
}
public abstract static class TestOp {
public abstract void performOp(double result[], double src[]);
}
public static class AssignOp extends TestOp {
public String toString() { return "assign"; }
public void performOp(double result[], double src[]) {
for(int i=0;i<src.length;i++) {
result[i] = src[i];
}
}
}
public static class AddOp extends TestOp {
public String toString() { return "add"; }
public void performOp(double result[], double src[]) {
for(int i=0;i<src.length;i++) {
result[i] = 0.143+src[i];
}
}
}
public static class AbsOp extends TestOp {
public String toString() { return "Math.abs()"; }
public void performOp(double result[], double src[]) {
for(int i=0;i<src.length;i++) {
result[i] = Math.abs(src[i]);
}
}
}
}
---------- END SOURCE ----------
CUSTOMER SUBMITTED WORKAROUND :
None. Attempts to force the more efficient code sequence such as:
private static double myAbs(double a) {
return Double.longBitsToDouble(Double.doubleToRawLongBits(a)&0x7fffffffffffffffL);
}
are not any faster due to the overheads of calling these methods on Double.
(Incident Review ID: 315868)
======================================================================
xxxxx@xxxxx 10/14/04 23:50 GMT
|
|
Evaluation
|
Fix this in Mustang
xxxxx@xxxxx 2004-09-29
I'm seeing performance of ABS that is roughly 2.5x slower, not 6x as the submitter claims:
@prt-solx86-q1-2$ $jp/bin/java -server AbsTest
avg 2.7 ns total 5.31E-1 s for assign (~ 4.0 cycles)
avg 2.9 ns total 5.74E-1 s for add (~ 4.3 cycles)
avg 7.6 ns total 1.53E0 s for Math.abs() (~ 11.4 cycles)
avg 2.7 ns total 5.31E-1 s for assign (~ 4.0 cycles)
avg 2.8 ns total 5.50E-1 s for add (~ 4.1 cycles)
avg 7.6 ns total 1.53E0 s for Math.abs() (~ 11.5 cycles)
My machine is a dual 1.5Ghz, but even on a P3-700Mhz performance is still roughly 2.5x slower
@foundation$ $jp/bin/java -server AbsTest
avg 5.8 ns total 1.17E0 s for assign (~ 4.1 cycles)
avg 6.4 ns total 1.29E0 s for add (~ 4.5 cycles)
avg 16.4 ns total 3.28E0 s for Math.abs() (~ 11.5 cycles)
avg 5.8 ns total 1.17E0 s for assign (~ 4.1 cycles)
avg 6.4 ns total 1.29E0 s for add (~ 4.5 cycles)
avg 16.4 ns total 3.27E0 s for Math.abs() (~ 11.5 cycles)
Looking into this...
xxxxx@xxxxx 10/14/04 23:50 GMT
It seems the SSE2 version of AbsD is SLOW. If I enable the flags -XX:UseSSE=0 or -XX:UseSSE=1 then ABS is fast again (because we are using FABS). Looking into a faster instruction for SSE2 machines
xxxxx@xxxxx 10/20/04 16:38 GMT
Got ABS intrinsified... 4cycles on a PC, 7 cycles on a SPARC (same as an ADD) and working on C1 now.
xxxxx@xxxxx 10/25/04 19:41 GMT
|
|
Comments
|
Submitted On 20-DEC-2004
mksreddy
My code which is using Math.abs (on double) is about 3 times faster in latest mustang build:
java version "1.6.0-ea"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.6.0-ea-b16
Java HotSpot(TM) Client VM (build 1.6.0-ea-b16, mixed mode, sharing)
Before mustang the methodn execution used to take 220ms on average, Now it takes about 70ms on average.
Thanks for the change.
PLEASE NOTE: JDK6 is formerly known as Project Mustang
|