Unicode浅析——调用科大讯飞语音合成接口（日语）所遇到的天坑

　　如题，最近做的项目需要调用科大讯飞的语音合成接口，将日文合成日语。然后坑爹的是跟我对接的那一方直接扔过来一份接口文档，里面并未提及日语合成所需要的参数。中文、英文合成倒是没问题，就这个日语合成的音频始终听起来不对。后来对接方说文本需要unicode编码，但具体如何编码他们也不清楚。这回至少有了思路，就拿文本做各种unicode编码。随后试来试去，好歹给试出来了是哪种unicode编码。这次天坑也算是知道了些unicode的皮毛。

　　Unicode，全称Universal Multiple-Octet Coded Character Set，通用多八位编码字符集，它是一套字符集，也是一套编码方案。字符我们知道，比如“中”就是一个字符，那啥是字符编码？字符编码是给计算机识别字符用的。计算机只知道0和1，比如“中”要让计算机表示出来，就必须将它转换为0和1，怎么转的就是字符编码搞定的。既然提到字符编码，离不开字符集，因为字符编码又是通过字符集来实现的。

　　字符集就是字符的集合，比如美国人制定的ASCII字符集包含了256个字符，用来表示英文字母、数字、标点符号、控制符等，而国人则制定了中文字符集GB2312，以及它的进化版GBK、GB18030。那么问题来，ASCII是单字节（8位），撑死了表示256个字符；GB2312、GBK是双字节，虽说能耐大些，也就撑死了表示65536个字符；GB18030已经增加到四字节了，但其他国家怎么办呢？它们不见得乐意用咱们的GB系列，都自立为王搞一套，那就都乱套了。为了能一劳永逸的解决字符集的问题，国际标准组织出手了，它搞出了一套四字节编码的字符集，包罗万象，管你英文还是中文，日文还是韩文，世界上所有语言的字符它大小通吃。

　　Unicode如此霸气侧漏，但我们需要注意一下，它不是一个具体的字符集，而是一套字符集，包括UTF-8、UTF-16、UTF-32等，这些字符集合起来就是一套编码方案。UTF，全称Unicode Transformation Format，如果是以8位（单字节）进行编码，就是UTF-8，同理，以双字节（16位）编码就是UTF-16，四字节就是UTF-32。Unicode的“字符”就是编码点（code point），通常写成 16 进制的形式再加前缀“U+”，例如“中”的编码点是U+4e2d。UTF-32空间开销太多，用的并不多。其实Unicode默认用的是UTF-16。最常用的是 UTF-8 ，它根据不同的字符进行不同的编码，可变的存储为 1 到 4 个字节，空间开销最小。

　　UTF-8单字节不存在字节序列问题，UTF-16、UTF-32就需要注意大小端点问题了。具体用个例子来说明比说一堆废话好理解，比如现在我用UTF-16表示“中”这个字符，因为有两个字节，那么这两个字节哪个先出现呢？大端点的编码点表示是U+4e2d，小端点的编码点则为：U+2d4e。Unicode默认使用一个BOM（Byte Order Mark，字节序列标记）来让计算机识别是大端还是小端，如果是大端则BOM字节码为：U+FEFF，反之小端使用U+FFFE。注意这个BOM是多出来标记，不表示有效字符。综合起来，存在3种表示方法，UTF-16BE（BE，Big Endian，大端点）、UTF-16LE（LE，Little Endian，小端点）和UTF-16（通过BOM识别大小端）。

　　说了这么多，不再废话，直接上代码吧：

import java.io.*;
import java.util.Arrays;


public class UnicodeTest {

    /**
     * 将字节转16进制数组
     *
     * @param bytes
     * @return
     */
    public static String[] toHexArr(byte[] bytes) {
        String[] hexArr = new String[bytes.length];
        for (int i = 0; i < bytes.length; i++) {
            String s = Integer.toHexString(bytes[i]);
            if (s.length() == 1) {
                s = "0" + s;
            }
            s = "0x" + s;
            hexArr[i] = s;
        }
        return hexArr;
    }

    /**
     * 将字节转字码点
     *
     * @param bytes
     * @return
     */
    public static String byteToUnicode(byte[] bytes) {
        StringBuffer out = new StringBuffer();

        //将其byte转换成对应的16进制表示
        for (int i = 0; i < bytes.length - 1; i += 2) {
            out.append("\u");
            String str = Integer.toHexString(bytes[i + 1] & 0xff);
            for (int j = str.length(); j < 2; j++) {
                out.append("0");
            }
            String str1 = Integer.toHexString(bytes[i] & 0xff);
            out.append(str1);
            out.append(str);
        }
        return out.toString();

    }

    public static void printCode(String s) {
        try {
            System.out.println("字符串编码：");
            byte[] bytes1 = s.getBytes("UNICODE");
            byte[] bytes2 = s.getBytes("UTF-16");
            byte[] bytes3 = s.getBytes("UTF-16BE");
            byte[] bytes4 = s.getBytes("UTF-16LE");
            byte[] bytes5 = s.getBytes("utf-8");

            System.out.println("16进制：");
            System.out.println("unicode : " + Arrays.toString(toHexArr(bytes1)));
            System.out.println("utf-16  : " + Arrays.toString(toHexArr(bytes2)));
            System.out.println("utf-16be: " + Arrays.toString(toHexArr(bytes3)));
            System.out.println("utf-16le：" + Arrays.toString(toHexArr(bytes4)));
            System.out.println("utf-8   ：" + Arrays.toString(toHexArr(bytes5)));

            System.out.println("unicode编码点：");
            System.out.println("unicode : " + byteToUnicode(bytes1));
            System.out.println("utf-16  : " + byteToUnicode(bytes2));
            System.out.println("utf-16be: " + byteToUnicode(bytes3));
            System.out.println("utf-16le：" + byteToUnicode(bytes4));
            System.out.println("utf-8   ：" + byteToUnicode(bytes5));

            System.out.println("字节数组：");
            System.out.println("unicode : " + Arrays.toString(bytes1));
            System.out.println("utf-16  : " + Arrays.toString(bytes2));
            System.out.println("utf-16be: " + Arrays.toString(bytes3));
            System.out.println("utf-16le：" + Arrays.toString(bytes4));
            System.out.println("utf-8   ：" + Arrays.toString(bytes5));

            System.out.println("字符串解码: ");
            System.out.println("unicode : " + new String(bytes1, "unicode"));
            System.out.println("utf-16  : " + new String(bytes2, "utf-16"));
            System.out.println("utf-16be: " + new String(bytes3, "utf-16be"));
            System.out.println("utf-16le：" + new String(bytes4, "utf-16le"));
            System.out.println("utf-8   ：" + new String(bytes5, "utf-8"));

        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        }

    }

    public static void main(String[] args) {
        String s = "中华人民共和国";
        String s1 = "おはよう";
        System.out.println("字符串内容：" + s);
        printCode(s);
        System.out.println("字符串内容：" + s1);
        printCode(s1);
    }
}

　　输出：

字符串内容：中华人民共和国
字符串编码：
16进制：
unicode : [0xfffffffe, 0xffffffff, 0x4e, 0x2d, 0x53, 0x4e, 0x4e, 0xffffffba, 0x6c, 0x11, 0x51, 0x71, 0x54, 0xffffff8c, 0x56, 0xfffffffd]
utf-16  : [0xfffffffe, 0xffffffff, 0x4e, 0x2d, 0x53, 0x4e, 0x4e, 0xffffffba, 0x6c, 0x11, 0x51, 0x71, 0x54, 0xffffff8c, 0x56, 0xfffffffd]
utf-16be: [0x4e, 0x2d, 0x53, 0x4e, 0x4e, 0xffffffba, 0x6c, 0x11, 0x51, 0x71, 0x54, 0xffffff8c, 0x56, 0xfffffffd]
utf-16le：[0x2d, 0x4e, 0x4e, 0x53, 0xffffffba, 0x4e, 0x11, 0x6c, 0x71, 0x51, 0xffffff8c, 0x54, 0xfffffffd, 0x56]
utf-8   ：[0xffffffe4, 0xffffffb8, 0xffffffad, 0xffffffe5, 0xffffff8d, 0xffffff8e, 0xffffffe4, 0xffffffba, 0xffffffba, 0xffffffe6, 0xffffffb0, 0xffffff91, 0xffffffe5, 0xffffff85, 0xffffffb1, 0xffffffe5, 0xffffff92, 0xffffff8c, 0xffffffe5, 0xffffff9b, 0xffffffbd]
unicode编码点：
unicode : ufeffu4e2du534eu4ebau6c11u5171u548cu56fd
utf-16  : ufeffu4e2du534eu4ebau6c11u5171u548cu56fd
utf-16be: u4e2du534eu4ebau6c11u5171u548cu56fd
utf-16le：u2d4eu4e53uba4eu116cu7151u8c54ufd56
utf-8   ：ue4b8uade5u8d8eue4baubae6ub091ue585ub1e5u928cue59b
字节数组：
unicode : [-2, -1, 78, 45, 83, 78, 78, -70, 108, 17, 81, 113, 84, -116, 86, -3]
utf-16  : [-2, -1, 78, 45, 83, 78, 78, -70, 108, 17, 81, 113, 84, -116, 86, -3]
utf-16be: [78, 45, 83, 78, 78, -70, 108, 17, 81, 113, 84, -116, 86, -3]
utf-16le：[45, 78, 78, 83, -70, 78, 17, 108, 113, 81, -116, 84, -3, 86]
utf-8   ：[-28, -72, -83, -27, -115, -114, -28, -70, -70, -26, -80, -111, -27, -123, -79, -27, -110, -116, -27, -101, -67]
字符串解码: 
unicode : 中华人民共和国
utf-16  : 中华人民共和国
utf-16be: 中华人民共和国
utf-16le：中华人民共和国
utf-8   ：中华人民共和国
字符串内容：おはよう
字符串编码：
16进制：
unicode : [0xfffffffe, 0xffffffff, 0x30, 0x4a, 0x30, 0x6f, 0x30, 0xffffff88, 0x30, 0x46]
utf-16  : [0xfffffffe, 0xffffffff, 0x30, 0x4a, 0x30, 0x6f, 0x30, 0xffffff88, 0x30, 0x46]
utf-16be: [0x30, 0x4a, 0x30, 0x6f, 0x30, 0xffffff88, 0x30, 0x46]
utf-16le：[0x4a, 0x30, 0x6f, 0x30, 0xffffff88, 0x30, 0x46, 0x30]
utf-8   ：[0xffffffe3, 0xffffff81, 0xffffff8a, 0xffffffe3, 0xffffff81, 0xffffffaf, 0xffffffe3, 0xffffff82, 0xffffff88, 0xffffffe3, 0xffffff81, 0xffffff86]
unicode编码点：
unicode : ufeffu304au306fu3088u3046
utf-16  : ufeffu304au306fu3088u3046
utf-16be: u304au306fu3088u3046
utf-16le：u4a30u6f30u8830u4630
utf-8   ：ue381u8ae3u81afue382u88e3u8186
字节数组：
unicode : [-2, -1, 48, 74, 48, 111, 48, -120, 48, 70]
utf-16  : [-2, -1, 48, 74, 48, 111, 48, -120, 48, 70]
utf-16be: [48, 74, 48, 111, 48, -120, 48, 70]
utf-16le：[74, 48, 111, 48, -120, 48, 70, 48]
utf-8   ：[-29, -127, -118, -29, -127, -81, -29, -126, -120, -29, -127, -122]
字符串解码: 
unicode : おはよう
utf-16  : おはよう
utf-16be: おはよう
utf-16le：おはよう
utf-8   ：おはよう