How to use the tools provided to train Tesseract for a new language

转 http://hi.baidu.com/romeroad/blog/item/4aec7d4a2fc69a2808f7ef58.html 训练TesseractHow to use the tools provided to train Tesseract for a new language.说明 Tesseract 2.0具有充分的可塑性. 本文描述了具体训练过程, 提供了一些引导说明可以用在任意的语言中, and what to expect from the results. 背景和限制 Tesseract 开始是为英语识别而编写的. 现在由于训练系统和识别引擎的改变他已经可以识别其他语言和 UTF-8 characters. Tesseract 2.0 能处理任何Unicode characters (coded with UTF-8), 但他能成功处理的语言还是有限制的, 所以请必须注意这些,然后再开始训练你的语言,不然只会让你失望 Tesseract 只能从左到右处理语言. 当你需要处理从右到左的语言是,输出文件是按照从左到右排列的. Top-to-bottom languages will currently be hopeless. Tesseract i现在还不能处理阿拉伯文 ,. Tesseract处理中文这种大字体集时可能会因变慢而不是很好用. 当字符超过 256 characters时,代码就需要相应改变一下. 核心算法是基于asii码的 ,所以对于一些语言中的特殊标点符号和数字可能无效需要的Data 文件 To train for another language, 在子文件夹testdata里你需要创建8个文件 . The naming convention is languagecode.file_name Language codes follow the ISO 639-3 standard. The 8 files used for English are:

tessdata/eng.freq-dawg
tessdata/eng.word-dawg
tessdata/eng.user-words
tessdata/eng.inttemp
tessdata/eng.normproto
tessdata/eng.pffmtable
tessdata/eng.unicharset
tessdata/eng.DangAmbigs

How little can you get away with? 你必须按下面的步骤建立 inttemp, normproto, pfftable and unicharset 四个文件. 如果你只需要几种字体, then a single training page might be enough. DangAmbigs and user-words 可以是空文件. 字典文件 freq-dawg and word-dawg 不必加太多单词,但准确率会降低 than if you have a decent sized (10s of thousands for English say) dictionary. 具体训练步骤 1.收集验证码，把所有验证码图片二值化，去噪点后，用PS合并在一张图片上,把图片转换成tif格式。如scan.tif 2.生成box文件运行"tesseract scan.tif scan batch.nochop makebox"; 会生成scan.txt文本文件，修正错误的字符。把scan.txt改名为scan.box(这一步可以用bbtesseract代替。 bbtesseract下载地址http://code.google.com/p/bbtesseract/downloads/list) 3.开始训练tesseract 运行"tesseract scan.tif junk nobatch box.train"; 生成文件scan.tr 4.Clustering 运行"mftraining scan.tr"; 生成文件"inttemp", "pffmtable" and "Microfeat"(Not used) 运行"cnTraining scan.tr";生成文件"normproto"; 5.Compute the Character Set 运行"unicharset_extractor scan.box"; 生成文件"unicharset" 6.Dictionary Data 这一步操作可以不用，直接复制其他的。 Create two UTF-8 text file, "frequent_words_list" and "words_list", the words in the files should not be duplicated; Run "wordlist2dawg frequent_words_list freq-dawg" Run "wordlist2dawg words_list word-dawg"; This will generate two files, "freq-dawg" and "word-dawg"; 7. Putting it all together All you need to do now is collect together all 8 files and rename them with a lang. prefix; File "eng.DangAmbigs" and "eng.user-words" could be empty; If create "eng.DangAmbigs" file, the characters must be exist in the "scan.box"; 8. Try it Run "tesseract scan.tif output -l eng" The file "output.txt" is the result; 快速步骤 1.收集验证码，把所有验证码图片二值化，去噪点后，用PS合并在一张图片上如图,把图片转换成tif格式。如scan.tif 2.生成box文件运行"tesseract scan.tif scan batch.nochop makebox"; 会生成scan.txt文本文件，修正错误的字符。把scan.txt改名为scan.box(这一步可以用bbtesseract代替。 bbtesseract下载地址http://code.google.com/p/bbtesseract/downloads/list) 3.把tesseract中training中的所有文件复制到tesseract.exe所在目录中，在tesseract.exe所在目录新建batch tesseract scan.tif junk nobatch box.train mftraining scan.tr cnTraining scan.tr unicharset_extractor scan.box 运行后，生成的inttemp，normproto，pffmtable，unicharset有用。