文本规范化

2.文本规范化

再进一步开展分析或 NLP 之前，首先需要规范文本文档的语料库。为此，将再次使用规范化模块，此外还需要应用一些专门针对内容的新技术。

在分析了许多语料库后，经过精心挑选了一些新词，并将它们更新禁了停用词名单，如下代码展示：

stopword_list = nltk.corpus.stopwords.words('english')
stopword_list = stopword_list + ['mr', 'mrs', 'come', 'go', 'get',
                                 'tell', 'listen', 'one', 'two', 'three',
                                 'four', 'five', 'six', 'seven', 'eight',
                                 'nine', 'zero', 'join', 'find', 'make',
                                 'say', 'ask', 'tell', 'see', 'try', 'back',
                                 'also']

可以看出新添加的单词大多数是通用的、没有多大意义的动词或名词。将它们更新进停用词列表对于文本聚类中的特征提取是十分有用的。还在规范化 pipeline 中添加了一个新函数，它使用正则表达式从文本主题中提取文本标识，如下所示：

import re
 
def keep_text_characters(text):
    filtered_tokens = []
    tokens = tokenize_text(text)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

将新函数连同前面反复使用的函数（包括扩展缩写词，解码 HTML，词语切分，删除停用词及特殊字符，词性还原）一起添加到最终的规范化函数中，如下：

def normalize_corpus(corpus, lemmatize=True,
                     only_text_chars=False,
                     tokenize=False):
     
    normalized_corpus = []   
    for text in corpus:
        text = html_parser.unescape(text)
        text = expand_contractions(text, CONTRACTION_MAP)
        if lemmatize:
            text = lemmatize_text(text)
        else:
            text = text.lower()
        text = remove_special_characters(text)
        text = remove_stopwords(text)
        if only_text_chars:
            text = keep_text_characters(text)
         
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
        else:
            normalized_corpus.append(text)
    return normalized_corpus

可以看出上述函数非常类似前面讲过的函数，只是添加了 keep_text_charachters() 函数来保留文本字符，该函数通过将 only_text_chars 参数设置为 True 来执行。