EVALUATION
http://hg.openjdk.java.net/jdk7/hotspot-comp/hotspot/rev/c771b7f43bbf
|
|
|
PUBLIC COMMENTS
And the numbers for AMD Shanghai:
$ gamma -XX:-UsePopCountInstruction test
sum: 629085184
time: 8504
$ gamma -XX:+UsePopCountInstruction test
sum: 629085184
time: 1807
4.7x speedup.
$ gamma -XX:-UsePopCountInstruction -XX:LoopUnrollLimit=1 test
sum: 629085184
time: 9622
$ gamma -XX:+UsePopCountInstruction -XX:LoopUnrollLimit=1 test
sum: 629085184
time: 2577
3.73x speedup.
|
|
|
PUBLIC COMMENTS
Just for the record to see how much slower the kernel-level trap-based emulation on SPARC is (with 20 * 1000000 loops):
$ gamma -XX:-UsePopCountInstruction test
VM option '-UsePopCountInstruction'
sum: 238869248
time: 1011
$ gamma -XX:+UsePopCountInstruction test
VM option '+UsePopCountInstruction'
sum: 238869248
time: 76985
|
|
|
EVALUATION
The same numbers on a T2:
$ java -XX:-UsePopCountInstruction test
VM option '-UsePopCountInstruction'
sum: 629085184
time: 35676
$ java -XX:+UsePopCountInstruction test
VM option '+UsePopCountInstruction'
sum: 629085184
time: 20007
And without loop unrolling:
$ java -XX:-UsePopCountInstruction -XX:LoopUnrollLimit=1 test
VM option '-UsePopCountInstruction'
VM option 'LoopUnrollLimit=1'
sum: 629085184
time: 41509
$ java -XX:+UsePopCountInstruction -XX:LoopUnrollLimit=1 test
VM option '+UsePopCountInstruction'
VM option 'LoopUnrollLimit=1'
sum: 629085184
time: 29470
The speedup is 1.78 and 1.41 respectively.
|
|
|
EVALUATION
A very simple micro-benchmark like this:
public class test {
public static void main(String[] args) {
int sum = 0;
long start = System.currentTimeMillis();
for (int i = 0; i < 2000 * 1000000; i++) {
sum += Integer.bitCount(i);
}
long end = System.currentTimeMillis();
System.out.println("sum: " + sum);
System.out.println("time: " + (end - start));
}
}
shows a 5x speedup on a Nehalem processor:
$ gamma -XX:-UsePopCountInstruction test
VM option '-UsePopCountInstruction'
sum: 629085184
time: 8132
$ gamma -XX:+UsePopCountInstruction test
VM option '+UsePopCountInstruction'
sum: 629085184
time: 1604
And with disabled loop unrolling to get more accurate numbers:
$ gamma -XX:-UsePopCountInstruction -XX:LoopUnrollLimit=1 test
VM option '-UsePopCountInstruction'
VM option 'LoopUnrollLimit=1'
sum: 629085184
time: 8657
$ gamma -XX:+UsePopCountInstruction -XX:LoopUnrollLimit=1 test
VM option '+UsePopCountInstruction'
VM option 'LoopUnrollLimit=1'
sum: 629085184
time: 1458
It's interesting to see that a tighter loop with popcnt is faster.
|
|
|