python 中文编码处理方法

decode early, unicode everywhere, encode late

1.在输入或者声明字符串的时候，尽早地使用decode方法将字符串转化成unicode编码格式；

2.然后在程序内使用字符串的时候统一使用unicode格式进行处理，比如字符串拼接、字符串替换、获取字符串的长度等操作；

3.最后，在输出字符串的时候（控制台/网页/文件），通过encode方法将字符串转化为你所想要的编码格式，比如utf-8等。

输入（str，utf-8）--decode--> 操作unicode --encode--> 输出(str, utf-8)

#-*-coding:gb2312#-*-

__author__='Administrator'

importchardet

#变量若声明前面不加u则在python2.x中采用ascii字符集，类型为str

ss1="我是中文"

printtype(ss1)#<type'str'>

printchardet.detect(ss1)#{'confidence':0.99,'encoding':'GB2312'}

#前面加u，采用Unicode字符集，类型为Unicode

ss2=u'中文'

printtype(ss2)#<type'unicode'>

#调用decode函数，将str转换成unicode

ss3=ss1.decode('gb2312')

printtype(ss3)#<type'unicode'>

#调用encode，将unicode转换成str

ss4=ss3.encode('utf8')

printtype(ss4)#<type'str'>

printchardet.detect(ss4)#{'confidence':0.938125,'encoding':'utf-8'}