Python-字符编码

1、python程序中的字符编码

python解释器在加载.py文件中的代码时，会对内容进行编码（默认ascill）。

执行以下代码：

[root@localhost ~]# cat test01.py

#/usr/bin/python2
print("你好，世界！")

在python2.7上：

[root@localhost ~]# workon test01
(test01) [root@localhost ~]# python
Python 2.7.5 (default, Oct 30 2018, 23:45:53) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()
(test01) [root@localhost ~]# python test01.py 
  File "test01.py", line 2
SyntaxError: Non-ASCII character 'xe4' in file test01.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

在python3.7上：

[root@localhost ~]# workon test02
(test02) [root@localhost ~]# python
Python 3.7.7 (default, Mar 27 2020, 12:29:36) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()
(test02) [root@localhost ~]# python test01.py 
你好，世界！

再次执行如下代码：

[root@localhost ~]# cat test02.py

#/usr/bin/python2.7
#-*- coding:utf-8 -*-
print("你好，世界！")

在python2.7上：

(test01) [root@localhost ~]# python
Python 2.7.5 (default, Oct 30 2018, 23:45:53) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()
(test01) [root@localhost ~]# python test02.py 
你好，世界！

在python3.7上：

(test02) [root@localhost ~]# python
Python 3.7.7 (default, Mar 27 2020, 12:29:36) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> exit()
(test02) [root@localhost ~]# python test02.py 
你好，世界！

python2.x中默认的编码格式是 ASCII 格式，在没修改编码格式时无法正确打印汉字，所以在读取中文时会报错。

解决方法为只要在文件开头加入 # -*- coding: UTF-8 -*- 或者 #coding=utf-8 就行了。

python3.x中：

就是说，在python3.7上Cpython强制使用utf-8编码。

注意：

python3.X 源码文件默认使用utf-8编码，所以可以正常解析中文，无需指定 UTF-8 编码。

小结：

python3 : 默认底层编码是Unicode，encode之后将是bytes;

python2 : 底层编码是bytes；

2、字符编码介绍

ASCII (美国信息交换标准代码）是基于拉丁字母的一套电脑编码系统，主要用于显示现代英语和其他西欧语言。到目前为止共定义了128个字符。ASCII 码使用指定的7 位或8 位二进制数组合来表示128 或256 种可能的字符。

有128-255个字符的扩展。

为了处理汉字，程序员设计了用于简体中文的GB2312和用于繁体的big5。

GB2312-80 是 1980 年制定的中国汉字编码国家标准。共收录 7445 个字符，其中汉字 6763 个。GB2312 兼容标准 ASCII码，采用扩展 ASCII 码的编码空间进行编码，一个汉字占用两个字节，每个字节的最高位为1。

1995年汉字的扩展规模达到21886个符号，分别是汉字区和图形符号区,产生了GBK1.0。

2000年的GB18030取代了GBK1.0的正式国家标准。该标准收录了27484个汉字，同时还收录了藏文，蒙文，维吾尔文等重要的少数民族文字。
从ASCII、GB2312、GBK到GB18030，这些编码是向下兼容的。

显然ASCII码是无法将世界上的各种文字和符号全部表示，所以，就需要新出一种可以代表所有字符和符号的编码，即Unicode。

Unicode（统一码，万国码，单一码）是一种在计算机上使用的字符编码，Unicode是为了解决传统的字符编码方案的局限性而产生的，他为每种语言中的每个字符设定了统一并且唯一的二进制编码，规定所有的字符和符号最少由16位来表示（2个字节）。

UTF-8，是对Unicode编码的压缩和优化，他不在是最少使用2个字节了，而是将所有的字符和符号进行分类：ASCII码中的内容用1个字节保存，欧洲的字符用2个字节保存，东亚的字符用3个字节保存。

3、字符编码与解码

python 3中最重要的新特性可能就是将文本(text)和二进制数据做了更清晰的区分。文本总是用unicode进行编码，以str类型表示；而二进制数据以bytes类型表示。

在python3中，不能以任何隐式方式将str和bytes类型二者混合使用。不可以将str和bytes类型进行拼接，不能在str中搜索bytes数据(反之亦然)，也不能将str作为参数传入需要bytes类型参数的函数(反之亦然)。

字符串和字节符之间划分界线是必然的。如下为字符串和字节符之间的转换：

strings可以被编码(encode)成字bytes,bytes也可以解码(decode)成strings：

#-*- coding:utf-8 -*-
msg = "我们都是好孩子"   #此时变量msg为utf-8格式 
print(msg)
print(msg.encode(encoding="utf-8"))    #将变量msg转换为bytes
print(msg.encode(encoding="utf-8").decode(encoding="utf-8"))    #再将变量msg转换为string

输出结果为：

我们都是好孩子
b'xe6x88x91xe4xbbxacxe9x83xbdxe6x98xafxe5xa5xbdxe5xadxa9xe5xadx90'
我们都是好孩子

在python2.7中的转换：

# -*- coding:utf-8 -*-
a = "你好"      #此时编码为utf-8
a_to_unic = a.decode("utf-8")   #将utf-8,decode为unic,调用时必须声明自己为utf-8,否则被认为是系统默认编码格式bytes
a_to_gbk = a_to_unic.encode("gbk") #将unic,encode为gbk,调用时必须声明转换为什么格式
a_to_utf8 = a_to_gbk.decode("gbk").encode("utf-8") #将gbk直接转换为utf-8

print(a_to_unic)
print(a_to_gbk)
print(a_to_utf8)

在python3.7中的转换：

#-*- coding:gbk -*-
msg = "我们都是好孩子"
print(msg)
m_to_unic = msg.encode("gbk")
m_to_uft8 = msg.encode("utf8")
m_to_gb2312 = msg.encode("gb2312")
print(m_to_unic)
print(m_to_uft8)
print(m_to_gb2312)

m_to_unic = msg.encode("gbk").decode("gbk")
m_to_uft8 = msg.encode("utf8").decode("utf8")
m_to_gb2312 = msg.encode("gb2312").decode("gb2312")

输出结果为：

我们都是好孩子
b'xcexd2xc3xc7xb6xbcxcaxc7xbaxc3xbaxa2xd7xd3'
b'xe6x88x91xe4xbbxacxe9x83xbdxe6x98xafxe5xa5xbdxe5xadxa9xe5xadx90'
b'xcexd2xc3xc7xb6xbcxcaxc7xbaxc3xbaxa2xd7xd3'
我们都是好孩子
我们都是好孩子
我们都是好孩子