utf-8 decode

摘自 Lua 5.3 源文件 lutf8lib.c

 1 /*
 2 ** Decode one UTF-8 sequence, returning NULL if byte sequence is invalid.
 3 */
 4 static const char *utf8_decode (const char *o, int *val) {
 5   static const unsigned int limits[] = {0xFF, 0x7F, 0x7FF, 0xFFFF};
 6   const unsigned char *s = (const unsigned char *)o;
 7   unsigned int c = s[0];
 8   unsigned int res = 0;  /* final result */
 9   if (c < 0x80)  /* ascii? */
10     res = c;
11   else {
12     int count = 0;  /* to count number of continuation bytes */
13     while (c & 0x40) {  /* still have continuation bytes? */
14       int cc = s[++count];  /* read next byte */
15       if ((cc & 0xC0) != 0x80)  /* not a continuation byte? */
16         return NULL;  /* invalid byte sequence */
17       res = (res << 6) | (cc & 0x3F);  /* add lower 6 bits from cont. byte */
18       c <<= 1;  /* to test next bit */
19     }
20     res |= ((c & 0x7F) << (count * 5));  /* add first byte */
21     if (count > 3 || res > MAXUNICODE || res <= limits[count])
22       return NULL;  /* invalid byte sequence */
23     s += count;  /* skip continuation bytes read */
24   }
25   if (val) *val = res;
26   return (const char *)s + 1;  /* +1 to include first byte */
27 }

关于 utf-8 的基础知识，参考 http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

UTF-8 的编码规则很简单，只有二条：

1）对于单字节的符号，字节的第一位设为0，后面7位为这个符号的 Unicode 码。因此对于英语字母，UTF-8 编码和 ASCII 码是相同的。

2）对于n字节的符号（n > 1），第一个字节的前n位都设为1，第n + 1位设为0，后面字节的前两位一律设为10。剩下的没有提及的二进制位，全部为这个符号的 Unicode 码。

下表总结了编码规则，字母x表示可用编码的位。

Unicode符号范围     |        UTF-8编码方式
(十六进制)        |              （二进制）
----------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

跟据上表，解读 UTF-8 编码非常简单。如果一个字节的第一位是0，则这个字节单独就是一个字符；如果第一位是1，则连续有多少个1，就表示当前字符占用多少个字节。