字符编码转换笔记

何为字符编码?

字符编码为计算机文字的存储格式, 例如 英文 字母 以ASCII编码存储, 即单字节存储,  其他字符编码有 UTF-8(通用字符编码格式), 其他区域性编码格式, 例如 ISO-8859(西欧), windows-1251俄文,中文GB编码。

为什么需要转换?

正因各个地区有不同的编码格式, 为了交换信息的目的, 就需要将相同字符的 从一种编码格式 转换为 另外一种编码格式。

 通用的编码格式为 UTF-8, 其囊括了 世界上所有字符, 所以一般为了通用性, 文件都以UTF-8编码(例如网页支持多语言显示的情况), 其他编码的语言一般都向UTF-8转换。

转换库LIBICONV

http://www.gnu.org/software/libiconv/#introduction

GNU世界提供了 一个开源 转换库, 支持若干编码 和 unicode 编码之间的转换。 此库可以再没有提供编码转换的系统上使用。

项目地址 http://savannah.gnu.org/projects/libiconv/

最新的Linux C库以已经提供 iconv 的转换,可以不用安装:

http://davidgao.github.io/LFSCN/chapter06/glibc.html

LFS 之外的某些程序包推荐安装 GNU libiconv 用于转换文本编码。此工程的主页 (http://www.gnu.org/software/libiconv/) 表示 “此库提供一个 iconv() 实现,用于没有提供此实现或无法操作 Unicode 的系统。” Glibc 提供一个 iconv() 实现并且可以操作 Unicode,所以在 LFS 系统上不必安装 libiconv。

LUAICONV

对于成熟的 lua, 对iconv功能进行了封装, 形成了一个专门的库,提供给LUA应用脚本使用。

官网介绍

http://ittner.github.io/lua-iconv/#download-and-installation

 local iconv = require("iconv")
  cd = iconv.new(to, from)
  cd = iconv.open(to, from)

  nstr, err = cd:iconv(str)

    Converts the 'str' string to the desired charset. This method always
    returns two arguments: the converted string and an error code, which
    may have any of the following values:

    nil
        No error. Conversion was successful.

    iconv.ERROR_NO_MEMORY
        Failed to allocate enough memory in the conversion process.

    iconv.ERROR_INVALID
        An invalid character was found in the input sequence.

    iconv.ERROR_INCOMPLETE
        An incomplete character was found in the input sequence.

    iconv.ERROR_FINALIZED
        Trying to use an already-finalized converter. This usually means
        that the user was tweaking the garbage collector private methods.

    iconv.ERROR_UNKNOWN
        There was an unknown error.

对于LUA 5.1版本, 推荐下载 lua-iconv-5 版本, 最新的-7版本兼容 LUA5.2

https://github.com/ittner/lua-iconv/releases/tag/lua-iconv-5

安装运行有报错:

:~/share_windows/openSource/lua/lua-iconv-lua-iconv-5$ lua test_iconv.lua
lua: error loading module 'iconv' from file './iconv.so':
    ./iconv.so: undefined symbol: libiconv_open
stack traceback:
    [C]: ?
    [C]: in function 'require'
    test_iconv.lua:1: in main chunk
    [C]: ?

经过查证(受到此文启发 http://tonybai.com/2013/04/25/a-libiconv-linkage-problem/), 

分析为先安装了 libiconv库,  导致 此库的iconv.h拷贝到 usr/local/include/iconv.h

然后编译 luaiconv工程,编译文件iconv.c文件时候, gcc先找到 usr/local/include/iconv.h 此文件, 以此文件内部的函数声明为准,编译出iconv.so

实际上次应该以系统提供的 iconv.h 为准,  此文件在 /usr/include/iconv.h

头文件gcc搜索次序:

:~/share_windows/openSource/lua/lua-iconv-lua-iconv-5$ ld -verbose | grep SEARCH
SEARCH_DIR("=/usr/i686-linux-gnu/lib32"); SEARCH_DIR("=/usr/local/lib32"); SEARCH_DIR("=/lib32"); SEARCH_DIR("=/usr/lib32"); SEARCH_DIR("=/usr/i686-linux-gnu/lib"); SEARCH_DIR("=/usr/local/lib/i386-linux-gnu"); SEARCH_DIR("=/usr/local/lib"); SEARCH_DIR("=/lib/i386-linux-gnu"); SEARCH_DIR("=/lib"); SEARCH_DIR("=/usr/lib/i386-linux-gnu"); SEARCH_DIR("=/usr/lib");

libiconv-------usr/local/include/iconv.h

#ifndef LIBICONV_PLUG
#define iconv_open libiconv_open
#endif
extern LIBICONV_DLL_EXPORTED iconv_t iconv_open (const char* tocode, const char* fromcode);

libiconv -- iconv.c 中 libiconv_open 定义收到宏控制, 应该未开启, 或者编译 luaiconv未链接libiconv库

#if defined __FreeBSD__ && !defined __gnu_freebsd__
/* GNU libiconv is the native FreeBSD iconv implementation since 2002.
   It wants to define the symbols 'iconv_open', 'iconv', 'iconv_close'.  */
#define strong_alias(name, aliasname) _strong_alias(name, aliasname)
#define _strong_alias(name, aliasname)
  extern __typeof (name) aliasname __attribute__ ((alias (#name)));
#undef iconv_open
#undef iconv
#undef iconv_close
strong_alias (libiconv_open, iconv_open)
strong_alias (libiconv, iconv)
strong_alias (libiconv_close, iconv_close)
#endif

解决方法: 修改实现文件中, 引用的 iconv.h 引用方式, 将标准方式, 修改为自定义,并且写为全路径 /usr/include/iconv.h

然后再次 make && make install, 运行ok

vim luaiconv.c


#include <lua.h>
#include <lauxlib.h>
#include <stdlib.h>

#include "/usr/include/iconv.h"
#include <errno.h>

安装运行其它报错参考:

https://github.com/ittner/lua-iconv/issues/3

生成转换表实验

在一些嵌入式系统上, 没有安装libiconv库, 或者 libc库中也没有实现 iconv 功能, 但是同时还是需要字符换场景,

可以在编译服务器上, 安装luaiconv, 利用系统的iconv功能, 生成 一种编码到另外一种编码的映射表, 然后利用此映射表来, 是实现转换。

例如, 将windows-1251转换为UTF-8

windows-1251 字符编码参考:

http://www.science.co.il/language/Character-code.asp?s=1251

生成表的LUA代码:

function serializeTable(val, name, skipnewlines, depth)
    skipnewlines = skipnewlines or false
    depth = depth or 0
    local tmp = string.rep(" ", depth)
    if name then tmp = tmp .. name .. " = " end
    if type(val) == "table" then
        tmp = tmp .. "{" .. (not skipnewlines and "
" or "")
        for k, v in pairs(val) do
            tmp = tmp .. serializeTable(v, k, skipnewlines, depth + 1) .. "," .. (not skipnewlines and "
" or "")
        end
        tmp = tmp .. string.rep(" ", depth) .. "}"
    elseif type(val) == "number" then
        tmp = tmp .. tostring(val)
    elseif type(val) == "string" then
        tmp = tmp .. string.format("%q", val)
    elseif type(val) == "boolean" then
        tmp = tmp .. (val and "true" or "false")
    else
        tmp = tmp .. ""[inserializeable datatype:" .. type(val) .. "]""
    end
    return tmp
end

local iconv = require("iconv")
-- Set your terminal encoding here
-- local termcs = "iso-8859-1"
local termcs = "utf-8"

function check_one(to, from, text)
  print("
-- Testing conversion from " .. from .. " to " .. to)
  local cd = iconv.new(to .. "//TRANSLIT", from)
  assert(cd, "Failed to create a converter object.")
  local ostr, err = cd:iconv(text)
  if err == iconv.ERROR_INCOMPLETE then
    print("ERROR: Incomplete input.")
  elseif err == iconv.ERROR_INVALID then
    print("ERROR: Invalid input.")
  elseif err == iconv.ERROR_NO_MEMORY then
    print("ERROR: Failed to allocate memory.")
  elseif err == iconv.ERROR_UNKNOWN then
    print("ERROR: There was an unknown error.")
  end

  print(ostr)
  return ostr
end
 
local result = {}
local num = 255
for i = 0, num do
  print("----------------------------------- i="..i)
  local char = string.char(i)
  local ostr = check_one(termcs, "windows-1251", char)
  print(string.len(ostr))
  local byteStr = ""
  for j = 1, string.len(ostr) do
      local byteVal = string.byte(ostr,j)
      print("byte j=" ..j .. " byteVal=".. byteVal)
      byteStr = byteStr .. "\" .. byteVal
  end
  print("char i=" ..i .. " byteStr=".. byteStr)
  table.insert(result, byteStr)
end

print("-----------------------------------!!")
s = serializeTable(result)
print(s)

整理后的 windows-1251转换为UTF-8 的表

lcoal transTbl_1251toutf8 = {
 1 = "",
 2 = "1",
 3 = "2",
 4 = "3",
 5 = "4",
 6 = "5",
 7 = "6",
 8 = "7",
 9 = "8",
 10 = "9",
 11 = "10",
 12 = "11",
 13 = "12",
 14 = "13",
 15 = "14",
 16 = "15",
 17 = "16",
 18 = "17",
 19 = "18",
 20 = "19",
 21 = "20",
 22 = "21",
 23 = "22",
 24 = "23",
 25 = "24",
 26 = "25",
 27 = "26",
 28 = "27",
 29 = "28",
 30 = "29",
 31 = "30",
 32 = "31",
 33 = "32",
 34 = "33",
 35 = "34",
 36 = "35",
 37 = "36",
 38 = "37",
 39 = "38",
 40 = "39",
 41 = "40",
 42 = "41",
 43 = "42",
 44 = "43",
 45 = "44",
 46 = "45",
 47 = "46",
 48 = "47",
 49 = "48",
 50 = "49",
 51 = "50",
 52 = "51",
 53 = "52",
 54 = "53",
 55 = "54",
 56 = "55",
 57 = "56",
 58 = "57",
 59 = "58",
 60 = "59",
 61 = "60",
 62 = "61",
 63 = "62",
 64 = "63",
 65 = "64",
 66 = "65",
 67 = "66",
 68 = "67",
 69 = "68",
 70 = "69",
 71 = "70",
 72 = "71",
 73 = "72",
 74 = "73",
 75 = "74",
 76 = "75",
 77 = "76",
 78 = "77",
 79 = "78",
 80 = "79",
 81 = "80",
 82 = "81",
 83 = "82",
 84 = "83",
 85 = "84",
 86 = "85",
 87 = "86",
 88 = "87",
 89 = "88",
 90 = "89",
 91 = "90",
 92 = "91",
 93 = "92",
 94 = "93",
 95 = "94",
 96 = "95",
 97 = "96",
 98 = "97",
 99 = "98",
 100 = "99",
 101 = "100",
 102 = "101",
 103 = "102",
 104 = "103",
 105 = "104",
 106 = "105",
 107 = "106",
 108 = "107",
 109 = "108",
 110 = "109",
 111 = "110",
 112 = "111",
 113 = "112",
 114 = "113",
 115 = "114",
 116 = "115",
 117 = "116",
 118 = "117",
 119 = "118",
 120 = "119",
 121 = "120",
 122 = "121",
 123 = "122",
 124 = "123",
 125 = "124",
 126 = "125",
 127 = "126",
 128 = "127",
 129 = "208130",
 130 = "208131",
 131 = "226128154",
 132 = "209147",
 133 = "226128158",
 134 = "226128166",
 135 = "226128160",
 136 = "226128161",
 137 = "226130172",
 138 = "226128176",
 139 = "208137",
 140 = "226128185",
 141 = "208138",
 142 = "208140",
 143 = "208139",
 144 = "208143",
 145 = "209146",
 146 = "226128152",
 147 = "226128153",
 148 = "226128156",
 149 = "226128157",
 150 = "226128162",
 151 = "226128147",
 152 = "226128148",
 153 = "",
 154 = "226132162",
 155 = "209153",
 156 = "226128186",
 157 = "209154",
 158 = "209156",
 159 = "209155",
 160 = "209159",
 161 = "194160",
 162 = "208142",
 163 = "209158",
 164 = "208136",
 165 = "194164",
 166 = "210144",
 167 = "194166",
 168 = "194167",
 169 = "208129",
 170 = "194169",
 171 = "208132",
 172 = "194171",
 173 = "194172",
 174 = "194173",
 175 = "194174",
 176 = "208135",
 177 = "194176",
 178 = "194177",
 179 = "208134",
 180 = "209150",
 181 = "210145",
 182 = "194181",
 183 = "194182",
 184 = "194183",
 185 = "209145",
 186 = "226132150",
 187 = "209148",
 188 = "194187",
 189 = "209152",
 190 = "208133",
 191 = "209149",
 192 = "209151",
 193 = "208144",
 194 = "208145",
 195 = "208146",
 196 = "208147",
 197 = "208148",
 198 = "208149",
 199 = "208150",
 200 = "208151",
 201 = "208152",
 202 = "208153",
 203 = "208154",
 204 = "208155",
 205 = "208156",
 206 = "208157",
 207 = "208158",
 208 = "208159",
 209 = "208160",
 210 = "208161",
 211 = "208162",
 212 = "208163",
 213 = "208164",
 214 = "208165",
 215 = "208166",
 216 = "208167",
 217 = "208168",
 218 = "208169",
 219 = "208170",
 220 = "208171",
 221 = "208172",
 222 = "208173",
 223 = "208174",
 224 = "208175",
 225 = "208176",
 226 = "208177",
 227 = "208178",
 228 = "208179",
 229 = "208180",
 230 = "208181",
 231 = "208182",
 232 = "208183",
 233 = "208184",
 234 = "208185",
 235 = "208186",
 236 = "208187",
 237 = "208188",
 238 = "208189",
 239 = "208190",
 240 = "208191",
 241 = "209128",
 242 = "209129",
 243 = "209130",
 244 = "209131",
 245 = "209132",
 246 = "209133",
 247 = "209134",
 248 = "209135",
 249 = "209136",
 250 = "209137",
 251 = "209138",
 252 = "209139",
 253 = "209140",
 254 = "209141",
 255 = "209142",
 256 = "209143",
}
原文地址:https://www.cnblogs.com/lightsong/p/4634642.html