实现简单高效的网址（文本）缩短服务

项目中有一处需求，需要把长网址缩为短网址，把结果通过短信、微信等渠道推送给客户。刚开始直接使用网上现成的开放服务，然后在某个周末突然手痒想自己动手实现一个别具特色的长网址（文本）缩短服务。

由于以前做过socket服务，对数据包的封装排列还有些印象，因此，短网址服务我第一反应是先设计数据的存储格式，我这里没有采用数据库，而是使用2个文件来实现：

Url.db存储用户提交的长网址文本，Url.idx 存储数据索引，记录每次提交数据的位置（Begin）与长度（Length），还有一些附带信息（Hits，DateTime）。由于每次添加长网址，对两个文件都是进行Append操作，因此即使这两个文件体积很大（比如若干GB），也没有太大的IO压力。

再看看Url.idx文件的结构，ID是主键，设为Int64类型，转换为字节数组后的长度为8，紧跟的是Begin，该值是把长网址数据续写到Url.db文件之前，Url.db文件的长度，同样设为Int64类型。长网址的字符串长度有限，Int16足够使用了，Int16.MaxValue==65536，比Url规范定义的4Kb长度还大，Int16转换为字节数组后长度为2字节。Hits表示短网址的解析次数，设为Int32，字节长度为4，DateTime 设为Int64，长度8。由于ID不会像数据库那样自动递增，因此需要手工实现。因此在开始写入Url.idx前，需要预先读取最后一行（行是虚的，其实就是最后30字节）中的的ID值，递增后才开始写入新的一行。

也就是说每次提交一个长网址，不管数据有多长（最大不能超过65536字节），Url.idx 文件都固定增加 30 字节。

数据结构一旦明确下来，整个网址缩短服务就变得简单明了。例如连续两次提交长网址，可能得到的短网址为http://域名/1000，与http://域名/1001，结果显然很丑陋，域名后面的ID全是数字，而且递增关系明显，很容易暴力枚举全部的数据。而且10进制的数字容量有限，一次提交100万条的长网址，产生的短网址越来越长，失去意义。

因此下面就开始对ID进行改造，改造的目标有2：

1、增加混淆机制，相邻两个ID表面上看不出区别。

2、增加容量，一次性提交100万条长网址，ID的长度不能有明显变化。

最简单最直接的混淆机制，就是把10进制转换为62进制（0-9a-zA-Z），由于顺序的abcdef...也很容易猜到下一个ID，因此62进制字符序列随机排列一次：

 1 /// <summary>
 2     /// 生成随机的0-9a-zA-Z字符串
 3     /// </summary>
 4     /// <returns></returns>
 5     public static string GenerateKeys()
 6     {
 7         string[] Chars = "0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z".Split(',');
 8         int SeekSeek = unchecked((int)DateTime.Now.Ticks);
 9         Random SeekRand = new Random(SeekSeek);
10         for (int i = 0; i < 100000; i++)
11         {
12             int r = SeekRand.Next(1, Chars.Length);
13             string f = Chars[0];
14             Chars[0] = Chars[r - 1];
15             Chars[r - 1] = f;
16         }
17         return string.Join("", Chars);
18     }

View Code

运行一次上面的方法，得到随机序列：

string Seq = "s9LFkgy5RovixI1aOf8UhdY3r4DMplQZJXPqebE0WSjBn7wVzmN2Gc6THCAKut";

用这个序列字符串替代0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ，具有很强的混淆特性。一个10进制的数字按上面的序列转换为62进制，将变得面目全非，附转换方法：

 1 /// <summary>
 2     /// 10进制转换为62进制
 3     /// </summary>
 4     /// <param name="id"></param>
 5     /// <returns></returns>
 6     private static string Convert(long id)
 7     {
 8         if (id < 62)
 9         {
10             return Seq[(int)id].ToString();
11         }
12         int y = (int)(id % 62);
13         long x = (long)(id / 62);
14 
15         return Convert(x) + Seq[y];
16     }
17 
18     /// <summary>
19     /// 将62进制转为10进制
20     /// </summary>
21     /// <param name="Num"></param>
22     /// <returns></returns>
23     private static long Convert(string Num)
24     {
25         long v = 0;
26         int Len = Num.Length;
27         for (int i = Len - 1; i >= 0; i--)
28         {
29             int t = Seq.IndexOf(Num[i]);
30             double s = (Len - i) - 1;
31             long m = (long)(Math.Pow(62, s) * t); 
32             v += m;
33         }
34         return v;
35     }

例如执行 Convert(123456789) 得到 RYswX，执行 Convert(123456790) 得到 RYswP。

如果通过分析大量的连续数值，还是可以暴力算出上面的Seq序列值，进而猜测到某个ID左右两边的数值。下面进一步强化混淆，ID每次递增的单位不是固定的1，而是一个随机值，比如1000,1005,1013,1014,1020，毫无规律可言。

private static Int16 GetRnd(Random seekRand)
    {
        Int16 s = (Int16)seekRand.Next(1, 11);
        return s;
    }

即使把62进制的值逆向计算出10进制的ID值，也难于猜测到左右两边的值，大大增加暴力枚举的难度。难度虽然增加，但是连续产生的2个62进制值如前面的RyswX与RyswP，仅个位数不同，还是很像，因此我们再进行第三次简单的混淆，把62进制字符向左（右）旋转一定次数（解析时反向旋转同样的次数）：

 1 /// <summary>
 2     /// 混淆id为字符串
 3     /// </summary>
 4     /// <param name="id"></param>
 5     /// <returns></returns>
 6     private static string Mixup(long id)
 7     {
 8         string Key = Convert(id);
 9         int s = 0;
10         foreach (char c in Key)
11         {
12             s += (int)c;
13         }
14         int Len = Key.Length;
15         int x = (s % Len);
16         char[] arr = Key.ToCharArray();
17         char[] newarr = new char[arr.Length];
18         Array.Copy(arr, x, newarr, 0, Len - x);
19         Array.Copy(arr, 0, newarr, Len - x, x);
20         string NewKey = "";
21         foreach (char c in newarr)
22         {
23             NewKey += c;
24         }
25         return NewKey;
26     }
27 
28     /// <summary>
29     /// 解开混淆字符串
30     /// </summary>
31     /// <param name="Key"></param>
32     /// <returns></returns>
33     private static long UnMixup(string Key)
34     {
35         int s = 0;
36         foreach (char c in Key)
37         {
38             s += (int)c;
39         }
40         int Len = Key.Length;
41         int x = (s % Len);
42         x = Len - x;
43         char[] arr = Key.ToCharArray();
44         char[] newarr = new char[arr.Length];
45         Array.Copy(arr, x, newarr, 0, Len - x);
46         Array.Copy(arr, 0, newarr, Len - x, x);
47         string NewKey = "";
48         foreach (char c in newarr)
49         {
50             NewKey += c;
51         }
52         return Convert(NewKey);
53     }

执行 Mixup(123456789)得到wXRYs，假如随机递增值为7，则下一条记录的ID执行 Mixup(123456796)得到swWRY，肉眼上很难再联想到这两个ID值是相邻的。

以上讲述了数据结构与ID的混淆机制，下面讲述的是短网址的解析机制。

得到了短网址，如wXRYs，我们可以通过上面提供的UnMixup()方法，逆向计算出ID值，由于ID不是递增步长为1的数字，因此不能根据ID马上计算出记录在索引文件中的位置（如：ID * 30）。由于ID是按小到大的顺序排列，因此在索引文件中定位ID，非二分查找法莫属。

 1 //二分法查找的核心代码片段
 2 FileStream Index = new FileStream(IndexFile, FileMode.OpenOrCreate, FileAccess.ReadWrite);
 3             long Id =;//解析短网址得到的真实ID
 4             long Left = 0;
 5             long Right = (long)(Index.Length / 30) - 1;
 6             long Middle = -1;            
 7             while (Left <= Right)
 8             {
 9                 Middle = (long)(Math.Floor((double)((Right + Left) / 2)));
10                 if (Middle < 0) break;
11                 Index.Position = Middle * 30;
12                 Index.Read(buff, 0, 8);
13                 long val = BitConverter.ToInt64(buff, 0);
14                 if (val == Id) break;                
15                 if (val < Id)
16                 {
17                     Left = Middle + 1;
18                 }
19                 else
20                 {
21                     Right = Middle - 1;
22                 }
23             }       
24 
25 Index.Close();

二分法查找的核心是不断移动指针，读取中间的8字节，转换为数字后再与目标ID比较的过程。这是一个非常高速的算法，如果有接近43亿条短网址记录，查找某一个ID，最多只需要移动32次指针（上面的while循环32次）就能找到结果，因为2^32=4294967296。

用二分法查找是因为前面使用了随机递增步长，如果递增步长设为1，则二分法可免，直接从 ID*30 就能一次性精准定位到索引文件中的位置。

下面是完整的代码，封装了一个ShortenUrl类：

  1 using System;
  2 using System.Linq;
  3 using System.Web;
  4 using System.IO;
  5 using System.Text;
  6 
  7 /// <summary>
  8 /// ShortenUrl 的摘要说明
  9 /// </summary>
 10 public class ShortenUrl
 11 {
 12     const string Seq = "s9LFkgy5RovixI1aOf8UhdY3r4DMplQZJXPqebE0WSjBn7wVzmN2Gc6THCAKut";
 13 
 14     private static string DataFile
 15     {
 16         get { return HttpContext.Current.Server.MapPath("/Url.db"); }
 17     }
 18 
 19     private static string IndexFile
 20     {
 21         get { return HttpContext.Current.Server.MapPath("/Url.idx"); }
 22     }
 23 
 24     /// <summary>
 25     /// 批量添加网址，按顺序返回Key。如果输入的一组网址中有不合法的元素，则返回数组的相同位置（下标）的元素将为null。
 26     /// </summary>
 27     /// <param name="Url"></param>    
 28     /// <returns></returns>
 29     public static string[] AddUrl(string[] Url)
 30     {
 31         FileStream Index = new FileStream(IndexFile, FileMode.OpenOrCreate, FileAccess.ReadWrite);
 32         FileStream Data = new FileStream(DataFile, FileMode.Append, FileAccess.Write);
 33         Data.Position = Data.Length;
 34         DateTime Now = DateTime.Now;
 35         byte[] dt = BitConverter.GetBytes(Now.ToBinary());
 36         int _Hits = 0;
 37         byte[] Hits = BitConverter.GetBytes(_Hits);
 38         string[] ResultKey = new string[Url.Length];
 39         int seekSeek = unchecked((int)Now.Ticks);
 40         Random seekRand = new Random(seekSeek);
 41         string Host = HttpContext.Current.Request.Url.Host.ToLower();        
 42         byte[] Status = BitConverter.GetBytes(true);
 43         //index: ID(8) + Begin(8) + Length(2) + Hits(4) + DateTime(8) = 30
 44         for (int i = 0; i < Url.Length && i<1000; i++)
 45         {
 46             if (Url[i].ToLower().Contains(Host) || Url[i].Length ==0 ||  Url[i].Length > 4096) continue;
 47             long Begin = Data.Position;            
 48             byte[] UrlData = Encoding.UTF8.GetBytes(Url[i]);            
 49             Data.Write(UrlData, 0, UrlData.Length);                        
 50             byte[] buff = new byte[8];
 51             long Last;
 52             if (Index.Length >= 30) //读取上一条记录的ID
 53             {
 54                 Index.Position = Index.Length - 30;
 55                 Index.Read(buff, 0, 8);
 56                 Index.Position += 22;
 57                 Last = BitConverter.ToInt64(buff, 0);
 58             }
 59             else
 60             {
 61                 Last = 1000000; //起步ID，如果太小，生成的短网址会太短。
 62                 Index.Position = 0;
 63             }
 64             long RandKey = Last + (long)GetRnd(seekRand);
 65             byte[] BeginData = BitConverter.GetBytes(Begin);
 66             byte[] LengthData = BitConverter.GetBytes((Int16)(UrlData.Length));
 67             byte[] RandKeyData = BitConverter.GetBytes(RandKey);
 68             
 69             Index.Write(RandKeyData, 0, 8);
 70             Index.Write(BeginData, 0, 8);
 71             Index.Write(LengthData, 0, 2);
 72             Index.Write(Hits, 0, Hits.Length);            
 73             Index.Write(dt, 0, dt.Length);            
 74             ResultKey[i] = Mixup(RandKey);
 75         }
 76         Data.Close();
 77         Index.Close();
 78         return ResultKey;
 79     }
 80   
 81     /// <summary>
 82     /// 按顺序批量解析Key，返回一组长网址。
 83     /// </summary>
 84     /// <param name="Key"></param>
 85     /// <returns></returns>
 86     public static string[] ParseUrl(string[] Key)
 87     {
 88         FileStream Index = new FileStream(IndexFile, FileMode.OpenOrCreate, FileAccess.ReadWrite);
 89         FileStream Data = new FileStream(DataFile, FileMode.Open, FileAccess.Read);        
 90         byte[] buff = new byte[8];
 91         long[] Ids = Key.Select(n => UnMixup(n)).ToArray();
 92         string[] Result = new string[Ids.Length];
 93         long _Right = (long)(Index.Length / 30) - 1;        
 94         for (int j = 0; j < Ids.Length; j++)
 95         {
 96             long Id = Ids[j];            
 97             long Left = 0;
 98             long Right = _Right;
 99             long Middle = -1;            
100             while (Left <= Right)
101             {
102                 Middle = (long)(Math.Floor((double)((Right + Left) / 2)));
103                 if (Middle < 0) break;
104                 Index.Position = Middle * 30;
105                 Index.Read(buff, 0, 8);
106                 long val = BitConverter.ToInt64(buff, 0);
107                 if (val == Id) break;                
108                 if (val < Id)
109                 {
110                     Left = Middle + 1;
111                 }
112                 else
113                 {
114                     Right = Middle - 1;
115                 }
116             }            
117             string Url = null;
118             if (Middle != -1)
119             {
120                 Index.Position = Middle * 30 + 8; //跳过ID           
121                 Index.Read(buff, 0, buff.Length);
122                 long Begin = BitConverter.ToInt64(buff, 0);
123                 Index.Read(buff, 0, buff.Length);
124                 Int16 Length = BitConverter.ToInt16(buff, 0);
125                 byte[] UrlTxt = new byte[Length];
126                 Data.Position = Begin;
127                 Data.Read(UrlTxt, 0, UrlTxt.Length);
128                 int Hits = BitConverter.ToInt32(buff, 2);//跳过2字节的Length
129                 byte[] NewHits = BitConverter.GetBytes(Hits + 1);//解析次数递增, 4字节
130                 Index.Position -= 6;//指针撤回到Length之后
131                 Index.Write(NewHits, 0, NewHits.Length);//覆盖老的Hits
132                 Url = Encoding.UTF8.GetString(UrlTxt);                       
133             }
134             Result[j] = Url;
135         }        
136         Data.Close();
137         Index.Close();
138         return Result;
139     }
140 
141     /// <summary>
142     /// 混淆id为字符串
143     /// </summary>
144     /// <param name="id"></param>
145     /// <returns></returns>
146     private static string Mixup(long id)
147     {
148         string Key = Convert(id);
149         int s = 0;
150         foreach (char c in Key)
151         {
152             s += (int)c;
153         }
154         int Len = Key.Length;
155         int x = (s % Len);
156         char[] arr = Key.ToCharArray();
157         char[] newarr = new char[arr.Length];
158         Array.Copy(arr, x, newarr, 0, Len - x);
159         Array.Copy(arr, 0, newarr, Len - x, x);
160         string NewKey = "";
161         foreach (char c in newarr)
162         {
163             NewKey += c;
164         }
165         return NewKey;
166     }
167 
168     /// <summary>
169     /// 解开混淆字符串
170     /// </summary>
171     /// <param name="Key"></param>
172     /// <returns></returns>
173     private static long UnMixup(string Key)
174     {
175         int s = 0;
176         foreach (char c in Key)
177         {
178             s += (int)c;
179         }
180         int Len = Key.Length;
181         int x = (s % Len);
182         x = Len - x;
183         char[] arr = Key.ToCharArray();
184         char[] newarr = new char[arr.Length];
185         Array.Copy(arr, x, newarr, 0, Len - x);
186         Array.Copy(arr, 0, newarr, Len - x, x);
187         string NewKey = "";        
188         foreach (char c in newarr)
189         {
190             NewKey += c;
191         }
192         return Convert(NewKey);
193     }
194 
195     /// <summary>
196     /// 10进制转换为62进制
197     /// </summary>
198     /// <param name="id"></param>
199     /// <returns></returns>
200     private static string Convert(long id)
201     {
202         if (id < 62)
203         {
204             return Seq[(int)id].ToString();
205         }
206         int y = (int)(id % 62);
207         long x = (long)(id / 62);
208 
209         return Convert(x) + Seq[y];
210     }
211 
212     /// <summary>
213     /// 将62进制转为10进制
214     /// </summary>
215     /// <param name="Num"></param>
216     /// <returns></returns>
217     private static long Convert(string Num)
218     {
219         long v = 0;
220         int Len = Num.Length;
221         for (int i = Len - 1; i >= 0; i--)
222         {
223             int t = Seq.IndexOf(Num[i]);
224             double s = (Len - i) - 1;
225             long m = (long)(Math.Pow(62, s) * t);
226             v += m;
227         }
228         return v;
229     }
230 
231     /// <summary>
232     /// 生成随机的0-9a-zA-Z字符串
233     /// </summary>
234     /// <returns></returns>
235     public static string GenerateKeys()
236     {
237         string[] Chars = "0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z".Split(',');
238         int SeekSeek = unchecked((int)DateTime.Now.Ticks);
239         Random SeekRand = new Random(SeekSeek);
240         for (int i = 0; i < 100000; i++)
241         {
242             int r = SeekRand.Next(1, Chars.Length);
243             string f = Chars[0];
244             Chars[0] = Chars[r - 1];
245             Chars[r - 1] = f;
246         }
247         return string.Join("", Chars);
248     }
249 
250     /// <summary>
251     /// 返回随机递增步长
252     /// </summary>
253     /// <param name="SeekRand"></param>
254     /// <returns></returns>
255     private static Int16 GetRnd(Random SeekRand)
256     {
257         Int16 Step = (Int16)SeekRand.Next(1, 11);
258         return Step;
259     }
260 }

View Code

本方案的优点：

把10进制的ID转换为62进制的字符，6位数的62进制字符容量为 62^6约为568亿，如果每次随机递增值为1~10（取平均值为5），6位字符的容量仍然能容纳113.6亿条！这个数据已经远远大于一般的数据库承受能力。由于每次提交长网址采用Append方式写入，因此写入性能也不会差。在解析短网址时由于采用二分法查找，仅移动文件指针与读取8字节的缓存，性能上依然非常优秀。

缺点：在高并发的情况下，可能会出现文件打开失败等IO异常，如果改用单线程的Node.js来实现，或许可以杜绝这种情况。

广告一下，短网址实际应用案例：http://urlj.cn，开放API，生成二维码，可以在微信上面玩。