pytesseract

简单使用
百度文字识别

介绍

Tesseract-OCR 是一款由HP实验室开发由Google维护的开源OCR（Optical Character Recognition , 光学字符识别）引擎。与Microsoft Office Document Imaging（MODI）相比，我们可以不断的训练的库，使图像转换文本的能力不断增强；如果团队深度需要，还可以以它为模板，开发出符合自身需求的OCR引擎。

安装

pip install pytesseract
windows 上要安装 tesseract-ocr-setup-4.00.00dev.exe 程序，然后设置环境变量

使用文档

要在指定路径下安装程序

简单使用

image_url = 'https://dss1.bdstatic.com/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=2437330463,3653504620&fm=26&gp=0.jpg'
image_body = requests.get(image_url).content
# 使用 Image.open 打开图片字节流，得到图片对象
image_stream = Image.open(io.BytesIO(image_body))
# 使用光学识别从图片对象中读取文字并打印输出结果
print(pytesseract.image_to_string(image_stream))

import io
import requests
from urllib.parse import urljoin
from parsel import Selector
try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract

url = 'http://www.porters.vip/confusion/recruit.html'
resp = requests.get(url)
sel = Selector(resp.text)

# 从响应正文中提取图片名称
image_name = sel.css('.pn::attr("src")').extract_first()
print(image_name)

# 拼接图片名和 URL
image_url = urljoin(url, image_name)
print(image_url)

# 请求图片，拿到图片的字节流内容
image_body = requests.get(image_url).content

# 使用 Image.open 打开图片字节流，得到图片对象
image_stream = Image.open(io.BytesIO(image_body))

# 使用光学识别从图片对象中读取文字并打印输出结果
print(pytesseract.image_to_string(image_stream))

当然，白嫖的识别率没有那么高

那么我们可以使用百度的付费 OCR 工具

百度文字识别

官方文档

使用

from aip import AipOcr
url = 'https://ss2.bdstatic.com/70cFvnSh_Q1YnxGkpoWK1HF6hhy/it/u=1403079776,3446215732&fm=26&gp=0.jpg'
""" 你的 APPID AK SK """
APP_ID = '*******'
API_KEY = 'sSGOb5PTw85V5eMbuGdCfTsM'
SECRET_KEY = 'KFs9LZu*****qooh5qCw4yndu1o5OGGz'

client = AipOcr(APP_ID, API_KEY, SECRET_KEY)

res = client.basicGeneralUrl(url)
print(res)

# 识别结果：{'log_id': 5472146157050550897, 'words_result_num': 4, 'words_result': [{'words': '努力吧,直到你的账'}, {'words': '户余额看起来像电话'}, {'words': '号码。'}, {'words': 'S6GNET'}]}