哈希查找（Hashing）

Back to Simple Search: Hashing

Linear search is OK for small data sets, bad for large.

So linear search would be OK if we could rapidly narrow the search to a few items.

Suppose that in constant time could put any item in our data set into a numbered bucket, where # buckets stays within a constant factor of # keys.

Suppose also that buckets contain roughly equal numbers of keys.

Then search would be constant time.

线性搜索对于小的集合是可行的，但是对于大集合却是糟糕的。

对于线性搜索，如果我们能迅速的缩小搜索范围变成只有几个元素，那将是非常有益的。

假设如果对于常定时间可以将一定的元素放进我们的被分成一定数量桶的数据集合，每个桶保持常数个数内的元素。

再假设每个桶包含大致相同的元素。

于是搜索将趋近常数时间完成。

Hash functions

To do this, must have way to convert key to bucket number: a hash function.
Example:
– N = 200 data items.
– keys are longs, evenly spread over the range 0..2⁶³ − 1.
– Want to keep maximum search to L = 2 items.
– Use hash function h(K) = K%M, where M = N/L = 100 is the number of buckets: 0 ≤ h(K) < M.
– So 100232, 433, and 10002332482 go into different buckets, but 10, 400210, and 210 all go into the same bucket.

为了达到这一目的，必须有一种方法将元素转为桶的编号：一个哈希函数。

例子：

— N = 100 个属于元素。

— 元素键值可能很长，大小从0到2⁶³-1。

— 保持最大搜索长度为 L = 2 个元素。

— 使用哈希函数 h(K) = K%M，其中 M = N/L = 100 是桶的个数：0 ≤ h(K) < M。

— 所以 100232，443，10002332482在不同的桶中，而10,400210，210则全在相同的桶中。

Filling the Table

To get (likely to be) constant-time lookup, need to keep #buckets within constant factor of #items.

So resize table when load factor gets higher than some limit.

In general, must re-hash all table items.

Still, this operation constant time per item,

So by doubling table size each time, get constant amortized time for insertion and lookup

(Assuming, that is, that our hash function is good).

为了让搜索时间为常数，需要保持常数个数的元素。

所以当元素个数达到一定限制，需要将表进行调整。

一般来说必须重新排列所有哈希列表元素。

不过，这个操作的成本将会平均分配到每个元素中。

所以通过每次将表的大小扩大一倍，对于插入和查找得到一个平均分摊事件成本。

（前提是我们的哈希函数是良好的）

Hash Functions: Strings

For String, "s₀s₁ · · · s_n−1" want function that takes all characters and their positions into account.

What’s wrong with s₀ + s₁ + . . . + s_n−1?

For strings, Java uses

h(s) = s₀ · ³¹ⁿ⁻¹ + s₁ · ³¹ⁿ⁻² + . . . + s_n−1

To convert to a table index in 0..N − 1, compute h(s)%N (but don’t use table size that is multiple of 31!)

Not as hard to compute as you might think; don’t even need multiplication!

　　int r; r = 0;
　　for (int i = 0; i < s.length (); i += 1)
　　　　r = (r << 5) - r + s.charAt (i);

对于字符串 "s₀s₁ · · · s_n−1" 需要一个函数去让所有字节和他们的位置变成一个数字。

如何处理s₀ + s₁ + . . . + s_n−1呢？

对于Java，是这么处理的

h(s) = s₀ · ³¹ⁿ⁻¹ + s₁ · ³¹ⁿ⁻² + . . . + s_n−1
如果要计算表的索引，则需要计算h(s)%N

这并不是很难计算，你可能能够发现，这甚至不需要乘法：
    int r; r = 0;
    for (int i = 0; i < s.length (); i += 1)
        r = (r << 5) - r + s.charAt (i);

Hash Functions: Other Data Structures

Lists (ArrayList, LinkedList, etc.) are analagous to strings: e.g., Java uses

　　hashCode = 1; Iterator i = list.iterator();
　　while (i.hasNext()) {
　　　　Object obj = i.next();
　　　　hashCode =
　　　　　　31*hashCode
　　　　　　+ (obj==null ? 0 : obj.hashCode());
　　}

Can limit time spent computing hash function by not looking at entire list. For example: look only at first few items (if dealing with a List or SortedSet).

Causes more collisions, but does not cause equal things to go to different buckets.

Lists类（ArrayList，LinkedList等）与字符串非常类似，例如在Java中：
　　hashCode = 1; Iterator i = list.iterator();
　　while (i.hasNext()) {
　　　　Object obj = i.next();
　　　　hashCode =
　　　　　　31*hashCode
　　　　　　+ (obj==null ? 0 : obj.hashCode());
　　}
可以通过不遍历整个列表限制计算哈希函数的时间。例如：只查找前面的几个元素（如果实在处理一个List或者一个SortedSet）。

产生更多冲突的可能性，但是相同的东西不会进入不同的桶里面。

Recursively defined data structures ⇒ recursively defined hash functions.

For example, on a binary tree, one can use something like hash(T):

　　if (T == null)
　　　　return 0;
　　else return someHashFunction (T.label ())
　　　　+ 255 * hash(T.left ())
　　　　+ 255*255 * hash(T.right ());

Can use address of object (“hash on identity”) if distinct (!=) objects are never considered equal.

But careful! Won’t work for Strings, because .equal Strings could be in different buckets:

　　String H = "Hello",
　　S1 = H + ", world!",
　　S2 = "Hello, world!";

Here S1.equals(S2), but S1 != S2.

可以通过递归定义数据结构 ⇒ 递归定义哈希函数。

例如，在二叉树中，可以这样定义hash(T)：
　　if (T == null)
　　　　return 0;
　　else return someHashFunction (T.label ())
　　　　+ 255 * hash(T.left ())
　　　　+ 255*255 * hash(T.right ());
可以使用对象的地址（也就是他的哈希身份证），如果地址不一样（!=），那么就把对象看成是不相同的。

但String类型却有些不同，它们可能相同，但是却不在同一个桶里：
　　String H = "Hello",
　　S1 = H + ", world!",
　　S2 = "Hello, world!";
在这里，S1.equals(S2)，但S1 != S2。

Hashing in HashMap

实际上HashMap是类似这样子的结构。通过哈希值将KEY分到不同的桶里面，而每个桶又是一个链表。

当通过一个KEY想得到里面的Value时，程序逻辑是先将KEY变成Hash，然后找到相应的桶，看看桶里面的元素是否是该KEY值，如果不是则到链表下一个节点，直到找到KEY为止。

Comparing Search Structures

相关资料

Data Structures (Into Java) . Paul N. Hilfinger