python正则的中文处理(转)

匹配中文时，正则表达式规则和目标字串的编码格式必须相同

    print sys.getdefaultencoding()
    text =u"#who#helloworld#a中文x#"
    print isinstance(text,unicode)
    print text

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 18: ordinal not in range(128)

print text报错
解释：控制台信息输出窗口是按照ascii编码输出的（英文系统的默认编码是ascii），而上面代码中的字符串是Unicode编码的，所以输出时产生了错误。
改成 print(word.encode('utf8'))即可

//确定系统默认编码
import sys
print sys.getdefaultencoding()
//'ascii'

//判断字符类型是否unicode
print isinstance(text,unicode)
//True

unicodepython字符互转

__author__ = 'medcl'
# -*- coding: utf-8 -*-
unistr= u'a';
pystr=unistr.encode('utf8')
unistr2=unicode(pystr,'utf8')

#需要unicode的环境
if not isinstance(input,unicode):
    temp=unicode(input,'utf8')
else:
    temp=input

#需要pythonstr的环境
if isinstance(input,unicode):
    temp2=input.encode('utf8')
else:
    temp2=input

正则获取No-ascii

内容：
"#who#helloworld#a中文x#"

正则：
r"[x80-xff]+"

输出：
中文

__author__ = 'medcl'
# -*- coding: utf-8 -*-
import re
def findPart(regex, text, name):
    res=re.findall(regex, text)
    if res:
        print "There are %d %s parts:
"% (len(res), name)
        for r in res:
            print "	",r.encode("utf8")
        print
 
text ="#who#helloworld#a中文x#"
usample=unicode(text,'utf8')
findPart(u"#[wu2E80-u9FFF]+#", usample, "unicode chinese")

输出

	#who#
	#a中文x#

几个主要非英文语系字符范围

2E80～33FFh：中日韩符号区。收容康熙字典部首、中日韩辅助部首、注音符号、日本假名、韩文音符，中日韩的符号、标点、带圈或带括符文数字、月份，以及日本的假名组合、单位、年号、月份、日期、时间等。
3400～4DFFh：中日韩认同表意文字扩充A区，总计收容6,582个中日韩汉字。
4E00～9FFFh：中日韩认同表意文字区，总计收容20,902个中日韩汉字。
A000～A4FFh：彝族文字区，收容中国南方彝族文字和字根。
AC00～D7FFh：韩文拼音组合字区，收容以韩文音符拼成的文字。
F900～FAFFh：中日韩兼容表意文字区，总计收容302个中日韩汉字。
FB00～FFFDh：文字表现形式区，收容组合拉丁文字、希伯来文、阿拉伯文、中日韩直式标点、小符号、半角符号、全角符号等。

REF:http://www.blogjava.net/Skynet/archive/2009/05/02/268628.html

http://iregex.org/blog/python-chinese-unicode-regular-expressions.html

本文来自: python正则的中文处理