Hash表解题之大数据查找

哈希表（散列表）的几个概念：

映像：由哈希函数得到的哈希表是一个映像。

冲突：如果两个关键字的哈希函数值相等，这种现象称为冲突。

处理冲突的几个方法：

1、开放地址法：用开放地址处理冲突就是当冲突发生时，形成一个地址序列，沿着这个序列逐个深测，直到找到一个“空”的开放地址，将发生冲突的关键字值存放到该地址中去。

例如：hash(i)=(hash(key)+d(i)) MOD m (i=1,2,3,......,k(k<m-1)) d为增量函数，d(i)=d1,d2,d3,...,dn-1

根据增量序列的取法不同，可以得到不同的开放地址处理冲突探测方法。

有线性探测法、二次方探测法、伪随机探测法。

二次探测法：h+1, h+4, h+9, ……, h+i^2,……

判断所有的不同位置都被访问过的条件为probe_count（probe：探测） < (hash_size + 1)/2时表示为访问完所有不同位置，当probe_count >=(hash_size + 1)/2时表示所有不同位置已经访问完，如果是循环可以跳出循环，此时（对应相应循环）不能再插入值或者要找的值不存在。

具体例子为：

1023. 简单哈希2

Total:

926

Accepted:

204

Time Limit: 1sec Memory Limit:256MB

Description

哈希表也称为散列表，它是通过关键码值而进行直接访问的数据结构。即它通过把一个关键码映射到表中的一个位置来访问记录，以加快查找速度。当然了在做关键码映射的时候，难免会把不同的关键码映射到相同的位置，这时冲突就产生了。使用平方探测法(Quadratic Probing)可以解决哈希中的冲突问题它的基本思想是：设hash函数为h(key) = d，并且假定其存储结构为循环数组，则当冲突发生时，它接下来需要探测的位置为h+1, h+4, h+9, ……, h+i^2,……直到冲突得到解决。

例如, 现有关键码集为 {47，7，29，11，16，92，22，8，3}，

设：哈希表表长为m=11；哈希函数为Hash(key)=key mod 11；采用平方探测法处理冲突。建哈希表如下：

0	1	2	3	4	5	6	7	8	9	10
11	22		47	92	16	3	7	29	8

现在给定哈希函数为Hash(key)= key mod m，要求按照上述规则, 使用平方探测法处理冲突的方法建立相应哈希表，并且处理以下操作。

Add a——表示把a(|a| <= 1000000000)加入到hash表中。

Query a——表示查询a是否在hash表中

Pint——表示打印出当前的hash表状态

End——结束操作

Input

输入的第一行为一个整数m（1<m<=1000），表示hash表所用的数组的大小，同时也表示hash函数需要模的值，见题意描述。

接下来会有若干行，表示操作（如题所述），当输入为End时结束程序。

Output

对于每一个Query a操作，如果a在hash表里面，输出yes，否则输出no，对于每一个Print函数，打印当前的hash表状态，格式为idx#key，其中idx表示数组下标，key表示关键值，如果该位置没有关键值，则输出NULL，每个元素占一行，如对应于上面所述hash表，它的Print结果为

0#11

1#22

2#NULL

3#47

4#92

5#16

6#3

7#7

8#29

9#8

10#NULL

Sample Input

Copy sample input to clipboard

5
Add 1
Add 5
Add 6
Query 1
Query 7
Print
End

Sample Output

yes
no
0#5
1#1
2#6
3#NULL
4#NULL

Problem Source: 2012年期末机考（Pan）

题目链接：http://soj.me/show_problem.php?pid=1023&cid=1092

 1 // Problem#: 8671
 2 // Submission#: 2493434
 3 // The source code is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License
 4 // URI: http://creativecommons.org/licenses/by-nc-sa/3.0/
 5 // All Copyright reserved by Informatic Lab of Sun Yat-sen University
 6 #include<iostream>
 7 #include<stdio.h>
 8 #include<string>
 9 using namespace std;
10 struct Node {
11        int element;
12        bool flag;
13        Node():element(-1), flag(false){}
14 }; 
15 int mod;
16 string select;
17 Node hash[1001]; 
18 int num;
19 void Add() {
20    scanf("%d", &num);
21    int location = num % mod;
22    if (hash[location].flag ==false) {
23       hash[location].flag = true;
24       hash[location].element = num;
25    } else {
26       int location_ = location;
27       int i = 0;
28       while (hash[location_].flag == true && i < (mod + 1)/2 ) {//i >=(mod + 1)/2时表示不能再插入
29             location_ = (location + i*i) % mod;
30             i++;
31       }
32       if (i < (mod + 1) /2 && hash[location_].flag == false) {
33           hash[location_].flag = true;
34           hash[location_].element = num;
35       }
36    }
37 }
38 void Query() {
39      scanf("%d", &num);
40      int location = num % mod;
41      int i = 0;
42      if (hash[location].element == num) {
43           printf("yes
");
44      }
45      else {
46          while(hash[location].element != num && i < (mod + 1)/2) {//i>=(mod + 1)/2表示查找不到
47              location = (location + i*i) % mod;  
48              if (hash[location].element == num) {
49                  printf("yes
"); 
50                  break;
51              }
52              i++;
53          }
54          if (i >= (mod + 1) /2 )
55             printf("no
");
56      }
57 }
58 void Print() {
59      for (int i = 0; i < mod; i++) {
60          if (hash[i].flag)
61              printf("%d#%d
", i, hash[i].element);
62          else
63              printf("%d#NULL
", i);
64      }
65 }
66 int main() {  
67     scanf("%d", &mod);
68     while (cin>>select) {
69                 if (select == "Add") {
70                    Add();
71                 } else if(select == "Query") {
72                    Query();
73                 } else if (select == "Print") {
74                    Print();
75                 } else if (select == "End")
76                    break;
77      }
78      return 0;
79 }

2、链地址法：把所有关键字为同义词的记录存储在一个线性链表中，这个链表成为同义词链表，即把具有相同哈希地址的关键字值存放在同义链表中。

拉链法：
拉链法解决冲突：
　拉链法解决冲突的做法是：将所有关键字为同义词的结点链接在同一个单链表中。若选定的散列表长度为m，则可将散列表定义为一个由m个头指针组成的指针数组（如：Node** T; ) T[0..m- 1]。凡是散列地址为i的结点，均插入到以T[i]为头指针的单链表中。T中各分量的初值均应为空指针。在拉链法中，装填因子α可以大于1，但一般均取α≤1。

拉链法的优点：

①拉链法处理冲突简单，且无堆积现象，即非同义词决不会发生冲突，因此平均查找长度较短；

　　②由于拉链法中各链表上的结点空间是动态申请的，故它更适合于造表前无法确定表长的情况；
　　③开放定址法为减少冲突，要求装填因子α较小，故当结点规模较大时会浪费很多空间。而拉链法中可取α≥1，且结点较大时，拉链法中增加的指针域可忽略不计，因此节省空间；
　　④在用拉链法构造的散列表中，删除结点的操作易于实现。只要简单地删去链表上相应的结点即可。而对开放地址法构造的散列表，删除结点不能简单地将被删结点的空间置为空，否则将截断在它之后填人散列表的同义词结点的查找路径。这是因为各种开放地址法中，空地址单元(即开放地址)都是查找失败的条件。因此在用开放地址法处理冲突的散列表上执行删除操作，只能在被删结点上做删除标记，而不能真正删除结点。

拉链法的缺点：
　拉链法的缺点是：指针需要额外的空间，故当结点规模较小时，开放定址法较为节省空间，而若将节省的指针空间用来扩大散列表的规模，可使装填因子变小，这又减少了开放定址法中的冲突，从而提高平均查找速度。

3、再哈希表：费时间的一种方法

常用字符串哈希函数：

unsigned int SDBMHash(char *str)
{
    unsigned int hash = 0;

    while (*str)
    {
        // equivalent to: hash = 65599*hash + (*str++);
        hash = (*str++) + (hash << 6) + (hash << 16) - hash;
    }

    return (hash & 0x7FFFFFFF);
}

// RS Hash Function
unsigned int RSHash(char *str)
{
    unsigned int b = 378551;
    unsigned int a = 63689;
    unsigned int hash = 0;

    while (*str)
    {
        hash = hash * a + (*str++);
        a *= b;
    }

    return (hash & 0x7FFFFFFF);
}

// JS Hash Function
unsigned int JSHash(char *str)
{
    unsigned int hash = 1315423911;

    while (*str)
    {
        hash ^= ((hash << 5) + (*str++) + (hash >> 2));
    }

    return (hash & 0x7FFFFFFF);
}

// P. J. Weinberger Hash Function
unsigned int PJWHash(char *str)
{
    unsigned int BitsInUnignedInt = (unsigned int)(sizeof(unsigned int) * 8);
    unsigned int ThreeQuarters    = (unsigned int)((BitsInUnignedInt  * 3) / 4);
    unsigned int OneEighth        = (unsigned int)(BitsInUnignedInt / 8);
    unsigned int HighBits         = (unsigned int)(0xFFFFFFFF) << (BitsInUnignedInt - OneEighth);
    unsigned int hash             = 0;
    unsigned int test             = 0;

    while (*str)
    {
        hash = (hash << OneEighth) + (*str++);
        if ((test = hash & HighBits) != 0)
        {
            hash = ((hash ^ (test >> ThreeQuarters)) & (~HighBits));
        }
    }

    return (hash & 0x7FFFFFFF);
}

// ELF Hash Function
unsigned int ELFHash(char *str)
{
    unsigned int hash = 0;
    unsigned int x    = 0;

    while (*str)
    {
        hash = (hash << 4) + (*str++);
        if ((x = hash & 0xF0000000L) != 0)
        {
            hash ^= (x >> 24);
            hash &= ~x;
        }
    }

    return (hash & 0x7FFFFFFF);
}

// BKDR Hash Function
unsigned int BKDRHash(char *str)
{
    unsigned int seed = 131; // 31 131 1313 13131 131313 etc..
    unsigned int hash = 0;

    while (*str)
    {
        hash = hash * seed + (*str++);
    }

    return (hash & 0x7FFFFFFF);
}

// DJB Hash Function
unsigned int DJBHash(char *str)
{
    unsigned int hash = 5381;

    while (*str)
    {
        hash += (hash << 5) + (*str++);
    }

    return (hash & 0x7FFFFFFF);
}

// AP Hash Function
unsigned int APHash(char *str)
{
    unsigned int hash = 0;
    int i;

    for (i=0; *str; i++)
    {
        if ((i & 1) == 0)
        {
            hash ^= ((hash << 7) ^ (*str++) ^ (hash >> 3));
        }
        else
        {
            hash ^= (~((hash << 11) ^ (*str++) ^ (hash >> 5)));
        }
    }

    return (hash & 0x7FFFFFFF);
}

http://198.74.100.235/wateroj/web/problem.php?id=1000

1000: 恶意IP

Time Limit: 1 Sec Memory Limit: 16 MB
Submit: 658 Solved: 72
[Submit][Status][Web Board]

Description

Water同学最近好不容易学会了用Tornado建起一个个人的Website，并且成功上线了。

来访用户逐渐增多，但Water发现总有些恶意用户很喜欢刷屏，总是回复些评论如“楼主不要放弃治疗！”，“楼主药不能停！”之类的。Water感受到了这个世界满满的恶意，他很不爽，决定将这部分恶意用户过滤掉。

他已经掌握到这些用户的IP了，但是过滤IP这件事情对于数据结构挂了的他来说实在是有些困难，所以他来找你帮忙了！

IP格式为 a.b.c.d , 其中 a,b,c,d均为[0,255]之间的整数。

Input

只有一组数据。第一行为一个整数N ［0, 1 000 000］，代表恶意IP列表的长度。接下来N行是N个恶意IP。
然后紧随一个整数M [0, 1 000 000]，代表访问IP的长度。接下来M行是M个来访IP。
你需要判断该来访IP是否在恶意IP列表中。

Output

如果来访IP是恶意IP，则输出 FILTER，否则输出 PASS。

Sample Input

5
233.233.233.233
250.250.250.250
10.20.30.40
123.255.123.255
172.18.182.69
6
10.123.128.245
233.233.233.233
172.18.182.253
102.30.40.50
172.18.182.96
172.18.182.69

Sample Output

PASS
FILTER
PASS
PASS
PASS
FILTER

HINT

听说 scanf 和 printf 比 cin 快？

测试平台为linux下gcc，linux用户可考虑参考 arpa/inet.h。

迫不得已的提示=-=：

1. 对于255.255.255.255 , 可考虑hash如 (255 << 24) + (255 << 16) + (255 << 8) + 255 , 以此类推

2. 桶的数量一般取素数以保证尽可能均匀分布，此题建议几十万左右的素数。

代码（拉链法实现）：

 1 #include<iostream>
 2 #include<stdio.h>
 3 #include<string.h>
 4 using namespace std;
 5 const int MAXN = 1000003;
 6 unsigned int SDBMHash(char *str)
 7 {
 8     unsigned int hash = 0;
 9     while (*str)
10     {
11         // equivalent to: hash = 65599*hash + (*str++);
12         hash = (*str++) + (hash << 6) + (hash << 16) - hash;
13     }
14     return (hash & 0x7FFFFFFF) % MAXN;
15 }
16 struct Node {
17     char* value;
18     Node* next;
19     Node():next(NULL){}
20     Node(char* value_, Node* next_ = NULL):value(value_),next(next_){}
21 };
22 Node **arr = new Node*[MAXN];
23 void createHash(int n) {
24      char* value_;
25      value_ = (char*)malloc(20*sizeof(char));
26      int location = 0;
27      for (int i = 0; i < n; i++) {      
28          scanf("%s", value_);
29          location = SDBMHash(value_);
30          if (arr[location] == NULL) {
31             arr[location] = new Node(value_);
32          } else {
33              Node* q = arr[location];
34              while (q->next != NULL) {
35                    q = q->next;
36              }
37              q->next = new Node(value_);
38          }
39          value_ = (char*)malloc(20*sizeof(char));
40      }
41 }
42 void find_ip(char* value) {
43      int location = SDBMHash(value);
44      if (arr[location] == NULL)
45         printf("PASS
");
46      else {
47          Node* p = arr[location];
48          while (p != NULL) {
49              if (strcmp(p->value,value) == 0) {
50                 printf("FILTER
");
51                 break;
52              }
53              p = p->next;
54          }
55          if (p == NULL)
56              printf("PASS
");
57      }
58 }
59 int main() {
60     char* value;
61     value = new char[20];
62     int evil_ips;
63     int visits;
64     for (int i = 0; i != MAXN; i++)
65              arr[i] = NULL;
66     scanf("%d", &evil_ips);
67     createHash(evil_ips);
68     scanf("%d", &visits);
69     for (int i = 0; i < visits; i++) {
70         scanf("%s", value);
71         find_ip(value);
72    }
73    return 0;
74 }
75