Java Solaris Communities Sun Store Join SDN My Profile Why Join?
 
Bug Database
Bug Detail
Quick Lists
Top 25 Bugs
Top 25 RFE's
Recently Closed Bugs
Printable Page Printable Page


Bug Database
Bug ID: 6526380
Votes 5
Synopsis Add API to access SIMD instructions
Category java:classes_lang
Reported Against
Release Fixed
State 11-Closed, Will Not Fix, request for enhancement
Priority: 4-Low
Related Bugs 6604786
Submit Date 19-FEB-2007
Description
A DESCRIPTION OF THE REQUEST :
An API to access SIMD instructions.

We need an API to access SIMD instructions, take advantage of hardware acceleration of vector-math. On the latest CPUs vector-math is factor 4 faster, which makes Java look inferiour. This can relatively easily be fixed, when we gain access to these CPU instructions indirectly.

java.lang.math.SIMD.add4(
                       float[] op1, int off1,
                       float[] op2, int off2,
                       float[] dst, int offDst
)

java.lang.math.SIMD.add4(
                       FloatBuffer op1, int off1,
                       FloatBuffer op2, int off2,
                       FloatBuffer dst, int offDst
)


default (bytecode) implementation of this method would be:
dst[offDst+0] = op1[off1+0] + op2[off2+0];
dst[offDst+1] = op1[off1+1] + op2[off2+1];
dst[offDst+2] = op1[off1+2] + op2[off2+2];
dst[offDst+3] = op1[off1+3] + op2[off2+3];

These methods are turned into instrincs at runtime (like sun.misc.Unsafe), using the vector-instructions of the current platform.

JUSTIFICATION :
With SIMD instructions one can do (theoreticly) 4 operations at a time. While most modern CPUs perform the SIMD instruction in 2+ cycles internally, this yields great performance improvements. In the latest (and upcoming) x86 CPUs, these operations are performed in 1 cycle internally.

The performance of the HotSpot JIT is ever increasing, but the gap between VM and native executable using SIMD, is widening. In vector-based code, or other mathematical SIMD-friendly algorithms, the performance can be multiplied by 200% - 400%, depending on the CPU's SIMD implementation.

__Making the JIT perform this optimisation behind the scenes is not sufficient__

  Programmers can invent smart(er) ways of dealing with data to make it SIMD-friendly / SIMD-optimal, while the JIT might overlook cases, or considers it too complex.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
float[] translate = new float[4];
float[] scale = new float[4];
float[] src = new float[vectors * 4];
float[] dst = new float[vectors * 4];

int end = vectors * 4;
for(int i=0; i<end; i+=4)
{
   SIMD.mul4(src, i, scale, 0, dst, i);
   SIMD.add4(src, i, translate, 0, dst, i);
}
Posted Date : 2007-02-19 13:01:12.0
Work Around
N/A
Evaluation
For the Java platform, it would be very uncharacteristic to provide an API of this sort.  The details of the SIMD intructions differ across archictures (and over time) and idioms that ran faster on some platforms could run slower on others.

These sorts of tranformations are better left to the jvm, which can more flexibly accomodate any issues of alignment and padding, data dependancies, etc.

Closing as will not fix.
Posted Date : 2007-03-01 05:00:33.0
Comments
  
  Include a link with my name & email   

Submitted On 25-FEB-2007
One problem to be resolved is to ensure the 'pointer' to be aligned to 128 bits. This might force a float[] to be allocated with this in mind, and the 'offset' argument must always be a multiple of 4.


Submitted On 25-FEB-2007
It would still be beneficial to have autovectorisation that is implicit if it can be performed. However, explicit intrisic SIMD instructions are really needed in this day and age.



PLEASE NOTE: JDK6 is formerly known as Project Mustang