10.tesseract

1.Tesseract-OCR简介

一个Google支持的开源的OCR图文识别开源项目。支持多种语言(我使用的是3.02 版本，支持包括英文,简体中文,繁体中文),支持Windows,Linux,Mac OSX 多平台。

2.Tesseract安装

下载windows版本的tesseract安装包，我下载的版本是是http://3.onj.me/tesseract/网站所维护的，安装后有个doc文件夹，里面有英文的使用文档。为了在全局使用方便，比如安装路径为D:Application esseract，将D:Application esseract添加到环境变量的path中。

注：
tessdata 目录存放的是语言字库文件，和在命令行界面中可能用到的参数所对应的文件. 这个安装程序默认包含了英文字库。
        如果想能识别其他语言，可以到https://github.com/tesseract-ocr/tessdata下载对应的语言的字库文件。
　　　　 下载完成后将该文件剪切到tessdata目录下去就可以了。

新增环境变量TESSDATA_PREFIX，值为D:Application esseract

3.Tesseract 使用

a. tesseract C:'Userppzc1.jpg result 默认英文

b. tesseract C:'Userppzc2.jpg result -l chi_sim 指定中文

4.简单使用

a.使用中文

import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd=r"D:	esseract	esseract.exe"
imgs=Image.open("1.png")
text1=pytesseract.image_to_string(imgs,lang="chi_sim")
print(text1)

b.使用默认英文

import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd=r"D:	esseract	esseract.exe"
imgs=Image.open("2.jpg")
text1=pytesseract.image_to_string(imgs)
print(text1)

5.案例

import pytesseract
from urllib import request
from PIL import Image
import time

def main():
    pytesseract.pytesseract.tesseract_cmd=r'D:	esseract	esseract.exe'
    url="https://passport.lagou.com/vcode/create?from=register&refresh=1513082291955"
    while True:
        request.urlretrieve(url,"1.png")
        image=Image.open("1.png")
        text=pytesseract.image_to_string(image)
        print(text)
        time.sleep(2)

if __name__=="__main__":
    main()