哈希表设计

LeetCode 上两道相似的 hash 数据结构设计题。

题目描述

两道题题干相似，都是给出接口要求实现数据结构。两道题都做出了一些简化以降低设计难度：

key / value 都用的是 int 类型；
ket / value 的取值范围限定在 [0, 10^6]；
O(1) API 的访问次数上限设定在 10^4 次；

解题思路

其实这道题的时间和空间卡的非常松，grandyang的方案直接用一个能覆盖key取值范围的大数组也能过。当然，对这种方案我们也可以稍微优化一下，比如 hashset 的时候用 bitmap 代替 int，可以压缩到原本所需空间的 1/32。

不过既然是考察哈希设计，我们还是要尽量考虑设计一个更有“哈希味儿”的方案，即使是做的简单粗糙一些：

直接写死哈希表的长度，映射到同一个哈希值的元素使用数组来存放；
相同哈希值的元素数组，使用简单的线性查找来维护；
hash function，可以考虑简单的对数组长度取余，最好是对质数取余；

参考代码

HashSet

这里的哈希函数，我在直接取余之前还做了一点混淆，用于让数据分散更均匀。

/*
 * @lc app=leetcode id=705 lang=cpp
 *
 * [705] Design HashSet
 */

// @lc code=start
class MyHashSet {
    constexpr static int nSlot = 521; // 997;
    vector<vector<int>> data;
    int hashFunc(int key) {
        return ((key >> 16) ^ (key * 31)) % nSlot;
    }
public:
    /** Initialize your data structure here. */
    MyHashSet() : data(nSlot) {        
    }
    
    void add(int key) {
        int index = hashFunc(key);
        for (int i=0; i<data[index].size(); i++) {
            if (data[index][i] == key) {
                return;
            }
        }
        data[index].push_back(key);
    }
    
    void remove(int key) {
        int index = hashFunc(key);
        for (int i=0; i<data[index].size(); i++) {
            if (data[index][i] == key) {
                data[index][i] = data[index].back();
                data[index].pop_back();
                return;
            }
        }
    }
    
    /** Returns true if this set contains the specified element */
    bool contains(int key) {
        int index = hashFunc(key);
        for (int i=0; i<data[index].size(); i++) {
            if (data[index][i] == key) {
                return true;
            }
        }
        return false;
    }
}; // AC

/**
 * Your MyHashSet object will be instantiated and called as such:
 * MyHashSet* obj = new MyHashSet();
 * obj->add(key);
 * obj->remove(key);
 * bool param_3 = obj->contains(key);
 */
// @lc code=end

HashMap

/*
 * @lc app=leetcode id=706 lang=cpp
 *
 * [706] Design HashMap
 */

// @lc code=start
class MyHashMap {
    // optimization TBD:
    // 1. scaling slots;
    // 1. binary-search for KV-pair in the same slot
    // 1. better hash function
    constexpr static int nSlot = 521; // 997;
    vector<vector<pair<int,int>>> data;
    int hashFunc(int key) {
        return ((key >> 16) ^ (key * 31)) % nSlot;
    }
public:
    /** Initialize your data structure here. */
    MyHashMap() : data(nSlot) {
    }
    
    /** value will always be non-negative. */
    void put(int key, int value) {
        int index = hashFunc(key);
        for (int i=0; i<data[index].size(); i++) {
            if (data[index][i].first == key) {
                data[index][i].second = value;
                return;
            }
        }
        data[index].push_back({key, value});
        return;
    }
    
    /** Returns the value to which the specified key is mapped, or -1 if this map contains no mapping for the key */
    int get(int key) {
        int index = hashFunc(key);
        for (int i=0; i<data[index].size(); i++) {
            if (data[index][i].first == key) {
                return data[index][i].second;
            }
        }
        return -1;
    }
    
    /** Removes the mapping of the specified value key if this map contains a mapping for the key */
    void remove(int key) {
        int index = hashFunc(key);
        for (int i=0; i<data[index].size(); i++) {
            if (data[index][i].first == key) {
                data[index][i] = data[index].back();
                data[index].pop_back();
                return;
            }
        }
    }
}; // AC

/**
 * Your MyHashMap object will be instantiated and called as such:
 * MyHashMap* obj = new MyHashMap();
 * obj->put(key,value);
 * int param_2 = obj->get(key);
 * obj->remove(key);
 */
// @lc code=end

拓展延伸

实际上我们的设计距离实用的哈希表（如Java的HashMap和C++的unordered_map）还有很大差距，表现在：

类型限定。本题只要求了简单的int类型，对于更多的类型比如string类型怎么计算hash值，自定义类型如何传入哈希函数。
容量伸缩。初始的容量可以小一些，减少内存占用；当数组的占用率超过某一阈值的时候启动扩容机制，并进行 re-hash。
对于哈希值相同的元素，使用链表组织；当链表长度超过某一阈值的时候，切换为使用红黑树组织。
使用分散度更好的hash函数。

以及，进一步的，考虑多线程并发访问的问题，是否需要加锁（如 ConcurrentHashMap），以及锁的类型、粒度设计。这些都可以作为程序员的“本钱”进行积累。