Hashcode Of A String In Java

Many of the Java programmers know what 'Hashcode' means, but don't really know how exactly it is calculated and why 31 is used to calculate the hashcode. Below is the code snippet from Java 1.6, which calculates the hashcode for a string:

public int hashCode() {
int h = hash;
if (h == 0) {
　　int off = offset;
　　char val[] = value;
　　int len = count;
 
　　for (int i = 0; i < len; i++) {
　　　　h = 31*h + val[off++];
　　}
　　hash = h;
} 
return h;
}

After reading up a bit, I wrote a sample test Java program, to find the hashcode of a string by multiplying by 31 (which is the same as shifting left (bitwise) by 5 times and subtracting, as in (i << 5) - i). Below is the sample test program:

public class TestHash {
public static void main(String[] args) {
String str1 = "What the heck?";
 
int hashcode1 = 0;
int hashcode2 = 0;
 
for(int i=0;i<str1.length();i++) {
hashcode1 = 31*hashcode1 + str1.charAt(i);
hashcode2 = (hashcode2 << 5) - hashcode2 + str1.charAt(i);
}
 
System.out.println("Hashcode1 : " + hashcode1);
System.out.println("Hashcode2 : " + hashcode2);
}
}

The output for this program is:

Hashcode1 : 277800975
Hashcode2 : 277800975

1、这段代码究竟是什么意思？

Even if someone knows why 31 is used, there is a lot of stuff to know about 'Hashing', 'Hash Collisions' and multiple algorithms related to calculating hash values. First off, its a known fact that there is no perfect hashing algorithm, for which there are no collisions. But there are several algorithms, which minimize the collisions and are good enough to use. Now, coming to why 31 is used in calculating hashcode, this is the reason given by Joshua Bloch, in the book 'Effective Java':

《Effective Java》是这样说的：之所以选择31，是因为它是个奇素数，如果乘数是偶数，并且乘法溢出的话，信息就会丢失，因为与2相乘等价于移位运算。使用素数的好处并不是很明显，但是习惯上都使用素数来计算散列结果。31有个很好的特性，就是用移位和减法来代替乘法，可以得到更好的性能：31*i==(i<<5)-i。现在的VM可以自动完成这种优化。

2、它返回的hashCode有什么特点呢？

可以看到，String类是用它的value值作为参数来计算hashCode的，也就是说，相同的value就一定会有相同的hashCode值。这点也很容易理解，因为value值相同，那么用equals比较也是相等的，equals方法比较相等，则hashCode一定相等。反过来不一定成立。它不保证相同的hashCode一定有相同的对象。

一个好的hash函数应该是这样的：为不相同的对象产生不相等的hashCode。

在理想情况下，hash函数应该把集合中不相等的实例均匀分布到所有可能的hashCode上，要想达到这种理想情形是非常困难的，至少java没有达到。因为我们可以看到，hashCode是非随机生成的，它有一定的规律，就是上面的数学等式，我们可以构造一些具有相同hashCode但value值不一样的，比如说：Aa和BB的hashCode是一样的。

说到这里，你可能会想，原来构造hash冲突那么简单啊，那我是不是可以对HashMap函数构造很多<key,value>不都一样，但具有相同的hashCode，这样的话可以把HashMap函数变成一条单向链表，运行时间由线性变为平方级呢？虽然HashMap重写的hashCode方法比String类的要复杂些，但理论上说是可以这么做的。这也是最近比较热门的Hash Collision DoS事件。

HashMap里重写的hashCode方法

       public final int hashCode() {
           return (key==null   ? 0 : key.hashCode()) ^
                   (value==null ? 0 : value.hashCode());
        }

reference：

http://crd1991.iteye.com/blog/1473108

http://java-bytes.blogspot.com/2009/10/hashcode-of-string-in-java.html