python 读取unicode编码文件

参考:

 https://blog.csdn.net/csdn_yi_e/article/details/71037288

https://blog.csdn.net/qq_42739440/article/details/89887451

1.chardet判断编码类型

import chardet
f=open('a.txt','rb')
text=f.read()
info=chardet.detect(text)
print(info)

{'encoding': 'UTF-16', 'confidence': 1.0, 'language': ''}

 2.编码解码读取

import chardet
f=open('a.txt',encoding='UTF-16')
text=f.read()
print(text.encode("utf-8").decode("unicode_escape"))

'1.新出吐鲁番文书及其研究'

 先编码然后解码读取到了中文文字。

3.bert中unicode

import six
def convert_to_unicode(text):
    """
    Converts `text` to Unicode (if it's not already), assuming UTF-8 input.
    """
    # six_ensure_text is copied from https://github.com/benjaminp/six
    def six_ensure_text(s, encoding="unicode_escape", errors="strict"):
        if isinstance(s, six.binary_type):
            print('true')
            return s.decode(encoding, errors)#如果是字节流,那么就以指定方式解码
        elif isinstance(s, six.text_type):#如果是文本类型,直接返回
            return s
        else:
            raise TypeError("not expecting type '%s'" % type(s))

    return six_ensure_text(text, encoding="unicode_escape", errors="ignore")

f=open('a.txt',encoding=('UTF-16'))
text=f.read()
print(convert_to_unicode(text.encode("utf-8")))

true
1.新出吐鲁番文书及其研究


注意:

>>> type(text.encode("utf-8"))#经过编码之后encode类型为字节类型
<class 'bytes'>

>>> type(text)#通过open中的encoding的是文件编码方式,text类型是str
<class 'str'>

https://six.readthedocs.io/

上面的二进制类型也就是py3中的字节类型。

原文地址:https://www.cnblogs.com/BlueBlueSea/p/13516695.html