字符串散列函数示例

　　最近需要设计一个分布式的历史数据存储系统。在设计历史数据表的时候发现，系统中对象的唯一标示符似乎只能是对象名称，虽然说也有一个唯一的ObjectID可以做标识符，不过在每一个对象中却没有单独存储自己的ID号。怎么办呢？如果用字符串作为数据表的主键，很显然是效率很低的。我想到了一种方式：应该有一种算法能够将字符串转换成唯一对应的整数呢？我们需要设计一个完全不冲突的哈希函数，即对任意的 key1 != key2 有h(key1) != h(key2)。

要做到完全的不冲突，还是很有难度的。我大概搜索了一下，找到了一个写的比较好的博客：

http://www.cnblogs.com/uvsjoh/archive/2012/03/27/2420120.html

　　在文中，作者深入分析了当前字符串哈希函数的情况，并给出了多个哈希函数的实现代码。于是，我决定试一下文中“得分最高”的BKDRHash算法：

// BKDR Hash Functionunsigned 
unsigned int BKDRHash(char *str)
{    
    unsigned int seed = 131; // 31 131 1313 13131 131313 etc..    
    unsigned int hash = 0;     
    while (*str)    
    {        
        hash = hash * seed + (*str++);    
    }     
    return (hash & 0x7FFFFFFF);
}

　　在我们的系统中，需要进行历史数据存储的点并不多，也就25K左右。在这个规模下，我想应该是不会出现冲突的吧。为了验证我的推断，也为了证明一下这个字符串哈希函数的正确性，我设计了如下测试：

　　详细设计代码如下：

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Data.SqlClient;

namespace StringHsash
{
    class Program
    {
        static UInt32 GetHashCode(String strInfo)
        {
            UInt32 nSeed = 131;     // 31 131 1313 13131 131313 etc...
            UInt32 nHashCode = 0;
            for (int i = 0; i < strInfo.Length; i++)
            {
                nHashCode = nHashCode * nSeed + (strInfo[i]);
            }

            return (nHashCode & 0x7FFFFFFF);
        }
        static void Main(string[] args)
        {
            string strFileName = @"F:Names.txt";
            StreamWriter swInput = File.AppendText(strFileName);

            // 从数据库中获取所有遥测对象名称
            string strConnect = "Server=GR;User Id=sa;Password=scbj;Database=NNGR;Connection Timeout=5;";
            string strSQL = "SELECT tag_name FROM [NNGR].[dbo].[遥测参数表] order by gobject_id ";
            SqlConnection sqlCon = new SqlConnection(strConnect);
            sqlCon.Open();
            SqlCommand sqlCmd = new SqlCommand(strSQL, sqlCon);
            SqlDataReader sqlReader = sqlCmd.ExecuteReader();
            while (sqlReader.Read())
            {
                swInput.WriteLine(String.Format("{0}", sqlReader[0]));
                //Console.WriteLine(String.Format("{0}", sqlReader[0]));
            }
            swInput.Close();
            Console.WriteLine("Write done!");

            // 读取文件字符串，并将其转换为HASH码，比较是否有冲突
            string[] strNameList = File.ReadAllLines(strFileName, Encoding.UTF8);
            Dictionary<UInt32, String> dicNames = new Dictionary<UInt32, String>();

            string strFileName2 = @"F:Conficts.txt";
            if (!File.Exists(strFileName2))
            {
                File.Create(strFileName2);
            }
            StreamWriter swOutput = File.AppendText(strFileName2);

            foreach (string strName in strNameList)
            {
                UInt32 nHashCode = GetHashCode(strName);
                if (!dicNames.ContainsKey(nHashCode))
                {
                    dicNames.Add(nHashCode, strName);
                }
                else
                {
                    swOutput.WriteLine("String [{0}] is conflict with: [{1}]<{2}>", 
                        strName, dicNames[nHashCode], nHashCode);
                    //Console.WriteLine("String [{0}] is conflict with: [{1}]<{2}>", 
                    //    strName, dicNames[nHashCode], nHashCode);
                }
            }
            swOutput.Close();

            Console.WriteLine("All is done!");
            Console.ReadKey();
        }
    }
}

　　经过实际测试，发现此算法果然没有任何冲突。当然了，我所测试的规模非常小，实际的需求也就这么点儿。算法详细的比较可以参考上文中提到的参考博客。