How to Write a Spelling Corrector

转载:http://norvig.com/spell-correct.html

第一次看到这篇文章,还是有些瑟瑟发抖的,不过仔细思考和阅读之后,发现这个功能还是比较容易实现的,就目前来说。同时也感谢作者对于知识的分享。

import re
from collections import Counter

def words(text): return re.findall(r'w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())):
    "Probability of `word`."
    return WORDS[word] / N

def correction(word):
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word):
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words):
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word):
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

The function correction(word) returns a likely spelling correction:

>>> correction('speling')
'spelling'

>>> correction('korrectud')
'corrected'
这里贴上作者的代码。主要实现的功能是通过读入字典或者其他带有大量词汇的txt文件,实现自动识别拼写错误。
虽然看似高大上,实现起来并不困难。作者通过各种python的技巧缩减了代码量。但核心理念就是,取得各个可能的正确词汇,然后代入字典中去一一比较,并通过权值(出现次数)得到最可能是实际想要拼写出来的单词。
先从edits1看起,它实现的是一个比较简单的错误修正。这里我们引用原文:
 a simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter).
The function edits1 returns a set of all the edited strings (whether words or not) that can be made with one simple edit:
作者对每个部分做了非常详细的解释。
再看edits2,它是在edits1的基础上,有两个错误修正(正常人拼一个单词拼错两个地方比较极限了吧。。),用了两个循环,for e1 in edits1(word)得到修改一次后的结果,for e2 in edits1(e1)得到修改二次后的结果,同样以set的结果return
known函数是把set中在字典中的词语取出来。
candidates就是综合了edits1中的set和edits2中的set以及原word,对他们用known函数处理后得到的值。
另:
set or set = set |! set
set and set = set &! set
具体运算方式待补
最后用
max(candidates(word), key=P)
处理得到的set,P是比较用的属性,为set中WORD[word]/SUM(WORDS.values)
最后得到权值最高的word。
原文地址:https://www.cnblogs.com/silencestorm/p/8557988.html