|
Description
|
A DESCRIPTION OF THE REQUEST :
An API to access SIMD instructions.
We need an API to access SIMD instructions, take advantage of hardware acceleration of vector-math. On the latest CPUs vector-math is factor 4 faster, which makes Java look inferiour. This can relatively easily be fixed, when we gain access to these CPU instructions indirectly.
java.lang.math.SIMD.add4(
float[] op1, int off1,
float[] op2, int off2,
float[] dst, int offDst
)
java.lang.math.SIMD.add4(
FloatBuffer op1, int off1,
FloatBuffer op2, int off2,
FloatBuffer dst, int offDst
)
default (bytecode) implementation of this method would be:
dst[offDst+0] = op1[off1+0] + op2[off2+0];
dst[offDst+1] = op1[off1+1] + op2[off2+1];
dst[offDst+2] = op1[off1+2] + op2[off2+2];
dst[offDst+3] = op1[off1+3] + op2[off2+3];
These methods are turned into instrincs at runtime (like sun.misc.Unsafe), using the vector-instructions of the current platform.
JUSTIFICATION :
With SIMD instructions one can do (theoreticly) 4 operations at a time. While most modern CPUs perform the SIMD instruction in 2+ cycles internally, this yields great performance improvements. In the latest (and upcoming) x86 CPUs, these operations are performed in 1 cycle internally.
The performance of the HotSpot JIT is ever increasing, but the gap between VM and native executable using SIMD, is widening. In vector-based code, or other mathematical SIMD-friendly algorithms, the performance can be multiplied by 200% - 400%, depending on the CPU's SIMD implementation.
__Making the JIT perform this optimisation behind the scenes is not sufficient__
Programmers can invent smart(er) ways of dealing with data to make it SIMD-friendly / SIMD-optimal, while the JIT might overlook cases, or considers it too complex.
EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
float[] translate = new float[4];
float[] scale = new float[4];
float[] src = new float[vectors * 4];
float[] dst = new float[vectors * 4];
int end = vectors * 4;
for(int i=0; i<end; i+=4)
{
SIMD.mul4(src, i, scale, 0, dst, i);
SIMD.add4(src, i, translate, 0, dst, i);
}
Posted Date : 2007-02-19 13:01:12.0
|
|
Evaluation
|
For the Java platform, it would be very uncharacteristic to provide an API of this sort. The details of the SIMD intructions differ across archictures (and over time) and idioms that ran faster on some platforms could run slower on others.
These sorts of tranformations are better left to the jvm, which can more flexibly accomodate any issues of alignment and padding, data dependancies, etc.
Closing as will not fix.
Posted Date : 2007-03-01 05:00:33.0
|
|
Comments
|
Submitted On 25-FEB-2007
One problem to be resolved is to ensure the 'pointer' to be aligned to 128 bits. This might force a float[] to be allocated with this in mind, and the 'offset' argument must always be a multiple of 4.
Submitted On 25-FEB-2007
It would still be beneficial to have autovectorisation that is implicit if it can be performed. However, explicit intrisic SIMD instructions are really needed in this day and age.
PLEASE NOTE: JDK6 is formerly known as Project Mustang
|