[Python学习笔记-011]HTML特殊字符处理

在HTML中，有大量的特殊字符，如果需要通过Python进行编码和解码，则需使用模块html。例如：

>>> import html
>>> s = ' " '
>>> html.escape(s)
' &quot; '
>>> html.unescape(' &quot; ')
' " '
>>>

因此，将特殊字符进行编码，使用html.escape()；反之，解码则使用html.unescape()。

一个实用的小脚本（foo.py）

 1 #!/usr/bin/python3
 2 """ Encode/Decode HTML special chars """
 3 
 4 import sys
 5 import html
 6 
 7 
 8 def main(argc, argv):
 9     if argc != 3:
10         print("Usage: %s <-e|-d> <chars>" % argv[0], file=sys.stderr)
11         return 1
12     op = argv[1]
13     chars = argv[2]
14     if op == '-e':
15         print(html.escape(chars))
16     else:
17         print(html.unescape(chars))
18     return 0
19 
20 
21 if __name__ == '__main__':
22     sys.exit(main(len(sys.argv), sys.argv))

Sample:

$ python3 foo.py -e "
&quot;
$ python3 foo.py -d '&quot;'
"

$ python3 foo.py -e '
&#x27;
$ python3 foo.py -d '&#x27;'
'

$ python3 foo.py -d '&sum;'
∑

参考资料：

https://dev.w3.org/html5/html-author/charref
html.escape() in Python
HTML Named character references