EVALUATION
HashMap was rewritten in 1.4 to use a power-of-two sized closed hash table. The performance of this implementation is, generally speaking, better than that of the previous implementation, but the new implementation is much more sensitive to poor hash functions. To mitigate this problem, the new implementation applies a supplemental hash function to the value returned by hashCode().
We initially chose to use h - (h << 7) (i.e., -127 * h) as the supplemental hash function. This function is very cheap to calculate. Because 127 is prime, it doesn't "lose any bits", but it fails badly if all of the useful information in the original hash value is in the high order bits. We did lots of tests using common hash keys and came to the conclusion that this was not a problem in practice. We were wrong. Double's hash function is sufficiently bad that it confounds the supplemental hash function. While Doubles are not commonly used as Map keys or Set elements, a general purpose Map or Set implementation should be able to handle them.
The following delightful program demonstrates the problem:
import java.util.*;
public class Baz {
static final int SIZE = 500;
static final int TABLE_LENGTH = 1024;
public static void main(String args[]) {
Set s = new HashSet();
for (int i = 0; i < SIZE; i++) {
int h = new Double(i).hashCode();
int hPrime = h - (h << 7); // i.e., -127 * h
int bucket = hPrime & (TABLE_LENGTH-1);
s.add(new Integer(bucket));
}
System.out.println(s.size());
}
}
The program prints 1, indicating that the first 500 doubles hash to the same bucket. We cannot easily change Double's hash function, as it is specified in detail. (As a general rule, hash functions should *not* be specified. See p. 41 of Bloch's "Effective Java Programming Language Guide" for a discussion.)
HashMap's supplemental hash function is a simplified version of the shift-add hash function in the Linux buffer cache. It is described in CITI Technical Report 00-1, "Linux Kernel Hash Table Behavior: Analysis and Improvements", by Chuck Lever (http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf). The full version of this hash function is:
((h << 7) - h + (h >>> 9) + (h >>> 17));
It costs a bit more to calculate (two extra shifts and two extra adds), which is why we didn't use it in the first place. In retrospect, this was a mistake. Using a good supplemental hash function amounts to an insurance policy against poor hash functions. It costs a bit every time you use the Map, but it guards against disasters.
Substituting the full version into the program above causes it to print out 298 instead of 1. While this is much better, a top-quality hash function would print out ~392. (This is what you get if you set hPrime to a random number in the program above.) This suggests that we might find a better hash function even than the full version above. On the other hand, any hash function represents a tradeoff between calculation cost and quality, and this function may be a fine compromise.
###@###.### 2002-04-17
I spent the last few days studying integer hash functions. It turns out that one can do much better than the Linux buffer cache hash function above. (It is possible to get equivalent performance from much cheaper functions.)
An exhaustive automated search over a family of shift-add-XOR hash functions turned up this function:
public static int hash(int key) {
key += ~(key << 9);
key ^= (key >>> 14);
key += (key << 4);
key ^= (key >>> 10);
return key;
}
While not perfect, this function represents an excellent price/quality tradeoff. As one trivial indication of its quality, it causes the "delightful program" above to print out 397, which is virtually identical to what you'd get with a top-quality hash function.
###@###.### 2002-04-19
One final note: On my machine, the reporter's test program ran in 2738 ms on release 1.4. On release 1.3, it ran in 607 ms. On a pre-release build of 1.4.1 (with the new supplemental hash function) it runs in 330 ms.
###@###.### 2002-04-20
|