【语言处理与Python】3.7用正则表达式为文本分词

分词是将字符串切割成可识破的构成一块语言数据的语言单元。

分词的简单方法

raw = """'When I'M a Duchess,'she said to herself, (not in a very hopeful tone

... though), 'I won'thave any pepper in mykitchenATALL.Soupdoesvery

... wellwithout--Maybeit's always pepper that makespeoplehot-tempered,'..."""

#最简单的方法是在空格处分割文本

re.split(r’\s+’,raw)

如果想更好的来使用正则表达式来起到分词的效果，还需要对正则表达式有更深的认识

符号功能

\b 词边界（零宽度）

\d 任一十进制数字（相当于[0-9]）

\D 任何非数字字符（等价于[^ 0-9]）

\s 任何空白字符（相当于[ \t\n\r\f\v]）

\S 任何非空白字符（相当于[^ \t\n\r\f\v]）

\w 任何字母数字字符（相当于[a-zA-Z0-9_]）

\W 任何非字母数字字符（相当于[^a-zA-Z0-9_]）

\t 制表符

\n 换行符

NLTK的正则表达式分词器

>>>text = 'That U.S.A.poster-print costs$12.40...'

>>>pattern =r'''(?x) #set flag to allow verbose regexps

... ([A-Z]\.)+ #abbreviations, e.g. U.S.A.

... | \w+(-\w+)* #words with optional internal hyphens

... | \$?\d+(\.\d+)?%? #currency and percentages,e.g. $12.40,82%

116

... | \.\.\. #ellipsis

... | [][.,;"'?():-_`] #these are separate tokens

... '''

>>>nltk.regexp_tokenize(text, pattern)

['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']