I figured they and many others could benefit from an explanation. Thefull details of an industrial-strength spell corrector like Google‘s would be moreconfusing than enlightening, but I figured that on the plane flight home, in lessthan a page of code, I could write a toy spelling corrector that achieves80 or 90% accuracy at a processing speed of at least 10 words per second.
So here, in 21 lines of Python 2.5 code, is the complete spellingcorrector:
import re, collections
def words(text): return re.findall(‘[a-z]+‘, text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file(‘big.txt‘).read()))
alphabet = ‘a(chǎn)bcdefghijklmnopqrstuvwxyz‘
def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=lambda w: NWORDS[w])
The code defines the functioncorrect, which takes a word as input and returnsa likely correction of that word. For example:
>>> correct(‘speling‘)
‘spelling‘
>>> correct(‘korrecter‘)
‘corrector‘
How does it work? First, a little theory. Given a word, we aretrying to choose the most likely spelling correction for that word(the "correction" may be the original word itself). There is no way toknow for sure (for example, should "lates" be corrected to "late" or"latest"?), which suggests we use probabilities. We will say that weare trying to find the correction c, out of all possiblecorrections, that maximizes the probability of c given theoriginal word w:
argmaxc P(c|w)By Bayes‘ Theorem this is equivalentto:
argmaxc P(w|c) P(c) / P(w)Since P(w) is the same for every possible c, we can ignore it, giving:
argmaxc P(w|c) P(c)There are three parts of this expression. From right to left, we have:
One obvious question is: why take a simple expression like P(c|w) and replaceit with a more complex expression involving two models rather than one? The answer is thatP(c|w) is already conflating two factors, and it iseasier to separate the two out and deal with them explicitly. Consider the misspelled wordw="thew" and the two candidate corrections c="the" and c="thaw".Which has a higher P(c|w)? Well, "thaw" seems good because the only changeis "a" to "e", which is a small change. On the other hand, "the" seems good because "the" is a verycommon word, and perhaps the typist‘s finger slipped off the "e" onto the "w". The point is that toestimate P(c|w) we have to consider both the probability of c and theprobability of the change from c to w anyway, so it is cleaner to formally separate thetwo factors.
Now we are ready to show how the program works. FirstP(c). We will read a big text file, big.txt, which consists of about a million words.The file is a concatenation of several public domain books from Project Gutenbergand lists of most frequent words from Wiktionaryand the BritishNational Corpus.(On the plane all I had was a collection of Sherlock Holmes stories thathappened to be on my laptop; I added the other sources later and stoppedadding texts when they stopped helping, as we shallsee in the Evaluation section.)
We then extract the individual words from the file (using thefunctionwords, which converts everything to lowercase, sothat "the" and "The" will be the same and then defines a word as asequence of alphabetic characters, so "don‘t" will be seen as thetwo words "don" and "t"). Next we train a probability model, whichis a fancy way of saying we count how many times each word occurs,using the functiontrain. It looks like this:
def words(text): return re.findall(‘[a-z]+‘, text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file(‘big.txt‘).read()))
At this point,NWORDS[w] holds a count of how many times thewordw has been seen. There is one complication: novelwords. What happens with a perfectly good word of English thatwasn‘t seen in our training data? It would be bad form to say theprobability of a word is zero just because we haven‘t seen it yet.There are several standard approaches to this problem; we take theeasiest one, which is to treat novel words as if we had seen themonce. This general process is called smoothing, because we are smoothingover the parts of the probability distribution that would have beenzero, bumping them up to the smallest possible count. This isachieved through the classcollections.defaultdict, which is like aregular Python dict (what other languages call hash tables) exceptthat we can specify the default value of any key; here we use 1.
Now let‘s look at the problem of enumerating the possible corrections cof a given word w. It is common to talk of the edit distancebetween two words: the number of edits it would take to turn one into the other.An edit can be a deletion (remove one letter), a transposition (swap adjacent letters),an alteration (change one letter to another) or an insertion (add a letter). Here‘s a functionthat returns a set of all words c that are one edit away from w:
def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + # deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + # transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet] + # alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]) # insertion
This can be a big set. For a word of length n, there willbe n deletions, n-1 transpositions, 26nalterations, and 26(n+1) insertions, for a total of54n+25 (of which a few are typically duplicates). For example,len(edits1(‘something‘)) -- that is, the number of elements in theresult of edits1(‘something‘) -- is 494.
The literature on spelling correction claims that 80 to 95% ofspelling errors are an edit distance of 1 from the target. As weshall see shortly, I put together a development corpus of 270 spellingerrors, and found that only 76% of them have edit distance 1. Perhapsthe examples I found are harder than typical errors. Anyway, I thoughtthis was not good enough, so we‘ll need to consider edit distance 2.That‘s easy: just applyedits1 to all the results ofedits1:
def edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1))
This is easy to write, but we‘re starting to get into some seriouscomputation: len(edits2(‘something‘)) is 114,324. However, we do getgood coverage: of the 270 test cases, only 3 have an edit distancegreater than 2. That is, edits2 will cover 98.9% of the cases; that‘sgood enough for me. Since we aren‘t going beyond edit distance 2, wecan do a small optimization: only keep the candidates that areactually known words. We still have to consider all the possibilities,but we don‘t have to build up a big set of them. The functionknown_edits2 does this:
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
Now, for example, known_edits2(‘something‘) is a set of just 3 words:{‘smoothing‘, ‘something‘, ‘soothing‘}, rather than the set of114,324 words generated by edits2. That speeds things up by about 10%.
Now the only part left is the error model,P(w|c). Here‘s where I ran into difficulty. Sitting onthe plane, with no internet connection, I was stymied: I had notraining data to build a model of spelling errors. I had someintuitions: mistaking one vowel for another is more probable thanmistaking two consonants; making an error on the first letter of aword is less probable, etc. But I had no numbers to back thatup. So Itook a shortcut: I defined a trivial model that says all known wordsof edit distance 1 are infinitely more probable than known words ofedit distance 2, and infinitely less probable than a known word ofedit distance 0. By "known word" I mean a word that we have seen inthe language model training data -- a word in the dictionary. We canimplement this strategy as follows:
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=lambda w: NWORDS[w])
The functioncorrect chooses as the set of candidate wordsthe set with the shortest edit distance to the original word, as longas the set has some known words. Once it identifies the candidate set toconsider, it chooses the element with the highest P(c) value, asestimated by theNWORDS model.
tests1 = { ‘a(chǎn)ccess‘: ‘a(chǎn)cess‘, ‘a(chǎn)ccessing‘: ‘a(chǎn)ccesing‘, ‘a(chǎn)ccommodation‘:
‘a(chǎn)ccomodation acommodation acomodation‘, ‘a(chǎn)ccount‘: ‘a(chǎn)count‘, ...}
tests2 = {‘forbidden‘: ‘forbiden‘, ‘decisions‘: ‘deciscions descisions‘,
‘supposedly‘: ‘supposidly‘, ‘embellishing‘: ‘embelishing‘, ...}
def spelltest(tests, bias=None, verbose=False):
import time
n, bad, unknown, start = 0, 0, 0, time.clock()
if bias:
for target in tests: NWORDS[target] += bias
for target,wrongs in tests.items():
for wrong in wrongs.split():
n += 1
w = correct(wrong)
if w!=target:
bad += 1
unknown += (target not in NWORDS)
if verbose:
print ‘%r => %r (%d); expected %r (%d)‘ % (
wrong, w, NWORDS[w], target, NWORDS[target])
return dict(bad=bad, n=n, bias=bias, pct=int(100. - 100.*bad/n),
unknown=unknown, secs=int(time.clock()-start) )
print spelltest(tests1)
print spelltest(tests2) ## only do this after everything is debuggedThis gives the following output:
{‘bad‘: 68, ‘bias‘: None, ‘unknown‘: 15, ‘secs‘: 16, ‘pct‘: 74, ‘n‘: 270}
{‘bad‘: 130, ‘bias‘: None, ‘unknown‘: 43, ‘secs‘: 26, ‘pct‘: 67, ‘n‘: 400}So on the development set of 270 cases, we get 74% correct in 13seconds (a rate of 17 Hz), and on the final test set we get 67%correct (at 15 Hz).
Update: In the original version of this essay I incorrectlyreported a higher score on both test sets, due to a bug. The bug wassubtle, but I should have caught it, and I apologize for misleadingthose who read the earlier version. In the original version ofspelltest, I left out theif bias: in the fourthline of the function (and the default value was bias=0, notbias=None). I figured that when bias = 0, the statementNWORDS[target] += bias would have no effect. In fact it doesnot change the value ofNWORDS[target], but it does have aneffect: it makes(target in NWORDS) true. So in effect thespelltest routine was cheating by making all the unknownwords known. This was a humbling error, and I have to admit that muchas I likedefaultdict for the brevity it adds to programs, Ithink I would not have had this bug if I had used regular dicts.
In conclusion, I met my goals for brevity, development time, and runtime speed, but not for accuracy.
correct(‘economtric‘) => ‘economic‘ (121); expected ‘econometric‘ (1)
correct(‘embaras‘) => ‘embargo‘ (8); expected ‘embarrass‘ (1)
correct(‘colate‘) => ‘coat‘ (173); expected ‘collate‘ (1)
correct(‘orentated‘) => ‘orentated‘ (1); expected ‘orientated‘ (1)
correct(‘unequivocaly‘) => ‘unequivocal‘ (2); expected ‘unequivocally‘ (1)
correct(‘generataed‘) => ‘generate‘ (2); expected ‘generated‘ (1)
correct(‘guidlines‘) => ‘guideline‘ (2); expected ‘guidelines‘ (1)
In this output we show the call to correct and the result (with the NWORDS count for the result in parentheses), and then the word expected by the test set (again with the count in parentheses). What this shows is that if you don‘t know that ‘econometric‘ is a word, you‘re not going to be able to correct ‘economtric‘. We could mitigate by adding more text to the training corpus, but then we also add words that might turn out to be the wrong answer. Note the last four lines above are inflections of words that do appear in the dictionary in other forms. So we might want a model that says it is okay to add ‘-ed‘ to a verb or ‘-s‘ to a noun.
The second potential source of error in the language model is bad probabilities: two words appear in the dictionary, but the wrong one appears more frequently. I must say that I couldn‘t find cases where this is the only fault; other problems seem much more serious.
We can simulate how much better we might do with a better language model by cheating on the tests: pretending that we have seen the correctly spelled word 1, 10, or more times. This simulates having more text (and just the right text) in the language model. The function spelltest has a parameter, bias, that does this. Here‘s what happens on the development and final test sets when we add more bias to the correctly-spelled words:
| Bias | Dev. | Test |
|---|---|---|
| 0 | 74% | 67% |
| 1 | 74% | 70% |
| 10 | 76% | 73% |
| 100 | 82% | 77% |
| 1000 | 89% | 80% |
On both test sets we get significant gains, approaching 80-90%. This suggests that it is possible that if we had a good enough language model we might get to our accuracy goal. On the other hand, this is probably optimistic, because as we build a bigger language model we would also introduce words that are the wrong answer, which this method does not do.
Another way to deal with unknown words is to allow the result of correct to be a word we have not seen. For example, if the input is "electroencephalographicallz", a good correction would be to change the final "z" to an "y", even though "electroencephalographically" is not in our dictionary. We could achieve this with a language model based on components of words: perhaps on syllables or suffixes (such as "-ally"), but it is far easier to base it on sequences of characters: 2-, 3- and 4-letter sequences.
correct(‘reciet‘) => ‘recite‘ (5); expected ‘receipt‘ (14)
correct(‘a(chǎn)dres‘) => ‘a(chǎn)cres‘ (37); expected ‘a(chǎn)ddress‘ (77)
correct(‘rember‘) => ‘member‘ (51); expected ‘remember‘ (162)
correct(‘juse‘) => ‘just‘ (768); expected ‘juice‘ (6)
correct(‘a(chǎn)ccesing‘) => ‘a(chǎn)cceding‘ (2); expected ‘a(chǎn)ssessing‘ (1)
Here, for example, the alteration of ‘d‘ to ‘c‘ to get from ‘a(chǎn)dres‘ to ‘a(chǎn)cres‘ should count more than the sum of the two changes ‘d‘ to ‘dd‘ and ‘s‘ to ‘ss‘.
Also, some cases where we choose the wrong word at the same edit distance:
correct(‘thay‘) => ‘that‘ (12513); expected ‘they‘ (4939)
correct(‘cleark‘) => ‘clear‘ (234); expected ‘clerk‘ (26)
correct(‘wer‘) => ‘her‘ (5285); expected ‘were‘ (4290)
correct(‘bonas‘) => ‘bones‘ (263); expected ‘bonus‘ (3)
correct(‘plesent‘) => ‘present‘ (330); expected ‘pleasant‘ (97)
The same type of lesson holds: In ‘thay‘, changing an ‘a(chǎn)‘ to an ‘e‘ should count as a smaller change than changing a ‘y‘ to a ‘t‘. How much smaller? It must be a least a factor of 2.5 to overcome the prior probability advantage of ‘that‘ over ‘they‘.
Clearly we could use a better model of the cost of edits. We could use our intuition to assign lower costs for doubling letters and changing a vowel to another vowel (as compared to an arbitrary letter change), but it seems better to gather data: to get a corpus of spelling errors, and count how likely it is to make each insertion, deletion, or alteration, given the surrounding characters. We need a lot of data to do this well. If we want to look at the change of one character for another, given a window of two characters on each side, that‘s 266, which is over 300 million characters. You‘d want several examples of each, on average, so we need at least a billion characters of correction data; probably safer with at least 10 billion.
Note there is a connection between the language model and the error model. The current program has such a simple error model (all edit distance 1 words before any edit distance 2 words) that it handicaps the language model: we are afraid to add obscure words to the model, because if one of those obscure words happens to be edit distance 1 from an input word, then it will be chosen, even if there is a very common word at edit distance 2. With a better error model we can be more aggressive about adding obscure words to the dictionary. Here are some examples where the presence of obscure words in the dictionary hurts us:
correct(‘wonted‘) => ‘wonted‘ (2); expected ‘wanted‘ (214)
correct(‘planed‘) => ‘planed‘ (2); expected ‘planned‘ (16)
correct(‘forth‘) => ‘forth‘ (83); expected ‘fourth‘ (79)
correct(‘et‘) => ‘et‘ (20); expected ‘set‘ (325)
purple perpul
curtains courtens
minutes muinets
successful sucssuful
hierarchy heiarky
profession preffeson
weighted wagted
inefficient ineffiect
availability avaiblity
thermawear thermawhere
nature natior
dissension desention
unnecessarily unessasarily
disappointing dissapoiting
acquaintances aquantences
thoughts thorts
criticism citisum
immediately imidatly
necessary necasery
necessary nessasary
necessary nessisary
unnecessary unessessay
night nite
minutes muiuets
assessing accesing
necessitates nessisitates
We could consider extending the model by allowing a limited set of edits at edit distance 3. For example, allowing only the insertion of a vowel next to another vowel, or the replacement of a vowel for another vowel, or replacing close consonants like "c" to "s" would handle almost all these cases.
correct(‘where‘) => ‘where‘ (123); expected ‘were‘ (452)
correct(‘latter‘) => ‘latter‘ (11); expected ‘later‘ (116)
correct(‘a(chǎn)dvice‘) => ‘a(chǎn)dvice‘ (64); expected ‘a(chǎn)dvise‘ (20)
We can‘t possibly know that correct(‘where‘) should be ‘were‘ in at least one case, but should remain ‘where‘ in other cases. But if the query had been correct(‘They where going‘) then it seems likely that "where" should be corrected to "were".
The context of the surrounding words can help when there are obvious errors, but two or more good candidate corrections. Consider:
correct(‘hown‘) => ‘how‘ (1316); expected ‘shown‘ (114)
correct(‘ther‘) => ‘the‘ (81031); expected ‘their‘ (3956)
correct(‘quies‘) => ‘quiet‘ (119); expected ‘queries‘ (1)
correct(‘natior‘) => ‘nation‘ (170); expected ‘nature‘ (171)
correct(‘thear‘) => ‘their‘ (3956); expected ‘there‘ (4973)
correct(‘carrers‘) => ‘carriers‘ (7); expected ‘careers‘ (2)
Why should ‘thear‘ be corrected as ‘there‘ rather than ‘their‘? It is difficult to tell by the single word alone, but if the query were correct(‘There‘s no there thear‘) it would be clear.
To build a model that looks at multiple words at a time, we will need a lot of data. Fortunately, Google has released a database of word counts for all sequences up to five words long, gathered from a corpus of a trillion words.
I believe that a spelling corrector that scores 90% accuracy will need to use the context of the surrounding words to make a choice. But we‘ll leave that for another day...
correct(‘a(chǎn)ranging‘) => ‘a(chǎn)rranging‘ (20); expected ‘a(chǎn)rrangeing‘ (1)
correct(‘sumarys‘) => ‘summary‘ (17); expected ‘summarys‘ (1)
correct(‘a(chǎn)urgument‘) => ‘a(chǎn)rgument‘ (33); expected ‘a(chǎn)uguments‘ (1)
We could also decide what dialect we are trying to train for. The following three errors are due to confusion about American versus British spelling (our training data contains both):
correct(‘humor‘) => ‘humor‘ (17); expected ‘humour‘ (5)
correct(‘oranisation‘) => ‘organisation‘ (8); expected ‘organization‘ (43)
correct(‘oranised‘) => ‘organised‘ (11); expected ‘organized‘ (70)
Thanks to Jay Liang for pointing out there are only 54n+25 distance 1 edits, not 55n+25 as I originally wrote.
Thanks to Dmitriy Ryaboy for pointing out there was a problem with unknown words; this allowed me tofind theNWORDS[target] += bias bug.
聯(lián)系客服