python 判断字符编码

一般情况下，需要加这个：

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

打开其他文件编码用codecs.open

读

下面的代码读取了文件，将每一行的内容组成了一个列表。

import codecs
file = codecs.open('test.txt','r','utf-8')
lines = [line.strip() for line in file] 
file.close()

当我们不知道文件编码的时候，如何程序判断呢？

使用 chardet 模块，这样就可以跟codecs结合起来了。

import chardet  
import urllib  
  
#可根据需要，选择不同的数据  
TestData = urllib.urlopen('http://www.baidu.com/').read()  
print chardet.detect(TestData)  
  
运行结果：  
{'confidence': 0.99, 'encoding': 'GB2312'}

参考： http://www.jb51.net/article/65790.htm 这里面还有判断网页的编码方式

http://blog.csdn.net/aqwd2008/article/details/7506007# 　　大文件可以只需要读几行

这种格式的转换为正常自体

a=u"u5973u7ae5u8f8du5b66u7167u987eu75c5u7236"
print a
a='u559cu6b22u4e00u4e2au4eba'
print a.decode('raw_unicode_escape')

/usr/bin/python2.7 /home/dahu/myfile/my_git/core-scrapy-learning/toutiao/toutiao/t1.py
女童辍学照顾病父
喜欢一个人

Process finished with exit code 0