Background

The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. Image input is managed by the Leptonica Image Processing Library which can read a wide variety of image formats.

Important Download Information:

The language data files are separate from the code!

See the ReadMe wiki for installation and usage information!

Additional installation and usage information can be found in the FAQ wiki.

Important License Note

The code is all licensed with the Apache 2.0 License EXCEPT the tesseractTrainer.py, which is licensed with GPL.

Supported Platforms

The developers are regularly testing on the following platforms:

Ubuntu 10.04 (x86/32, x86/64)
Windows (x86/32) with Visual C++ Express 2008/2010

Additionally, we believe that the code should be running on these other platforms, but we don't have the resources to test on them regularly:

recent Linux distributions (x86/32, x86/64)
Mac OS X (x86, PPC)

People have reported success with Cygwin on Windows, but this is not a tested platform.

If you're interested in supporting other platforms or languages, please get in touch with Ray Smith.

Roadmap

Version 3.01 release is now available for download and contains many new features. (See the ReleaseNotes for a full list.) Most notable new features:

New Languages, Arabic, Hindi, Thai.
Thread Safety.
New PageIterator and ResultIterator APIs for extracting detailed recognition information.

Please check out the ReadMe before going to Downloads as you need more than one file. Even the windows executables tarball is incomplete as language files are required.

The upcoming 3.02 release will probably include:

Hebrew with BiDi support.
More languages.

Core Developers

The core developer on the project is Ray Smith (theraysmith).

Thomas Breuel (tmbdev) and Ilya Mezhirov (mezhirov) work on the OCRopus project, for which Tesseract is one of the pluggable OCR engines; OCRopus also provides layout analysis and statistical language modeling.

Most of the work on Tesseract is sponsored by Google.