List和hashtable之查找效率

假设这样的一个场景，有两个List<string>,分别是listA和listB，其中均有超过1w+的数据量，现在要求找出在B中但是不在A中的数据集合，

于是我们写了以下代码

View Code

listB.ForEach(re =>
             {
                 if (!listA.Contains(re))
                 {
                     //.......
                 }
             });

这样写的话，是能够达到我们的需求的，但是很不幸，时间也会消耗很多。

我们看下List.Contains的源代码实现

View Code

[__DynamicallyInvokable]
 public bool Contains(T item)
 {
     if (item == null)
     {
         for (int j = 0; j < this._size; j++)
         {
             if (this._items[j] == null)
             {
                 return true;
             }
         }
         return false;
     }
     EqualityComparer<T> comparer = EqualityComparer<T>.Default;
     for (int i = 0; i < this._size; i++)
     {
         if (comparer.Equals(this._items[i], item))
         {
             return true;
         }
     }
     return false;
 }

这是因为List.Contains 本质是通过循环查找出该条数据，每一次的调用都会重头循环，所以效率很低。显然，这是不可取的。

于是，我们可以利用Hashtable的键值对改善这种情况。

代码如下：

View Code

listA.ForEach(re =>
             {
                 table.Add(re, re);
             });
             listB.ForEach(re =>
                 {
                     if (table.ContainsKey(re))
                     {
                         //......
                     }
                 });

虽然，多了一个循环，但是总时间确实减少了很多。

我们可以看下Hashtable.ContainsKey源代码：

View Code

public virtual bool ContainsKey(object key)
 {
     uint num;
     uint num2;
     Hashtable.bucket bucket;
     if (key == null)
     {
         throw new ArgumentNullException("key", Environment.GetResourceString("ArgumentNull_Key"));
     }
     Hashtable.bucket[] buckets = this.buckets;
     uint num3 = this.InitHash(key, buckets.Length, out num, out num2);
     int num4 = 0;
     int index = (int) (num % buckets.Length);
     do
     {
         bucket = buckets[index];
         if (bucket.key == null)
         {
             return false;
         }
         if (((bucket.hash_coll & 0x7fffffff) == num3) && this.KeyEquals(bucket.key, key))
         {
             return true;
         }
         index = (int) ((index + num2) % ((ulong) buckets.Length));
     }
     while ((bucket.hash_coll < 0) && (++num4 < buckets.Length));
     return false;
 }

可见，哈希表的数据查找效率不是盖出来的。

我们可以通过下面的代码验证：

View Code

static void Main(string[] args)
         {
             List<string> listA = new List<string>();
             List<string> listB = new List<string>();
             for (int i = 0; i < 9999; i++)
             {
                 listA.Add(i.ToString());
                 listB.Add((i + 1).ToString());
             }
             Stopwatch watch = new Stopwatch();
             watch.Start();
             listB.ForEach(re =>
             {
                 if (!listA.Contains(re))
                 {
                     //.......
                 }
             });
             watch.Stop();
             Console.WriteLine("List.Contains need {0}", watch.ElapsedTicks);
             Hashtable table = new Hashtable();
             watch.Restart();
             listA.ForEach(re =>
             {
                 table.Add(re, re);
             });
             listB.ForEach(re =>
                 {
                     if (table.ContainsKey(re))
                     {
                         //......
                     }
                 });
             watch.Stop();
             Console.WriteLine("Hashtable.Contains need {0}", watch.ElapsedTicks);
             Console.ReadKey(true);
         }

执行的结果如下：

List.Contains need 1920001 Hashtable.Contains need 8493 可以看到，效率提升了很多，当数据量越大，效果越明显。

注：本帖只是为了说明可以使用Hashtable改善List的查找性能，不做其他用处