How to Write a Spelling Corrector

转载：http://norvig.com/spell-correct.html

第一次看到这篇文章，还是有些瑟瑟发抖的，不过仔细思考和阅读之后，发现这个功能还是比较容易实现的，就目前来说。同时也感谢作者对于知识的分享。

import re
from collections import Counter

def words(text): return re.findall(r'w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

The function correction(word) returns a likely spelling correction:

>>> correction('speling')
'spelling'

>>> correction('korrectud')
'corrected'
这里贴上作者的代码。主要实现的功能是通过读入字典或者其他带有大量词汇的txt文件，实现自动识别拼写错误。
虽然看似高大上，实现起来并不困难。作者通过各种python的技巧缩减了代码量。但核心理念就是，取得各个可能的正确词汇，然后代入字典中去一一比较，并通过权值（出现次数）得到最可能是实际想要拼写出来的单词。
先从edits1看起，它实现的是一个比较简单的错误修正。这里我们引用原文：
 a simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter). 
The function edits1 returns a set of all the edited strings (whether words or not) that can be made with one simple edit:
作者对每个部分做了非常详细的解释。
再看edits2，它是在edits1的基础上，有两个错误修正（正常人拼一个单词拼错两个地方比较极限了吧。。），用了两个循环，for e1 in edits1(word)得到修改一次后的结果，for e2 in edits1(e1)得到修改二次后的结果，同样以set的结果return
known函数是把set中在字典中的词语取出来。
candidates就是综合了edits1中的set和edits2中的set以及原word，对他们用known函数处理后得到的值。
另：
set or set = set |！ set
set and set = set &！ set
具体运算方式待补
最后用

max(candidates(word), key=P)
处理得到的set，P是比较用的属性，为set中WORD[word]/SUM(WORDS.values)
最后得到权值最高的word。