Puneetha D S
5th Sem BCA, NCMS
Tesseract is an open source text recognition engine which is available under Apache license2.0. It is free software, originally developed by Hewlett-Packard and has been funded by GOOGLE. Tesseract can extract text from image directly or using API. Even it can recognize wide variety of languages. It is easy to install and use.
Tesseract is written in
C and C++ and it is available for Linux, Windows and macOS operating
systems. It is compatible with many programming languages and frame works. It
recognizes the pixels, letters, words, and sentences on the image. When we input
image it will analyze the adaptive thresholding and converts to binary image
and it analyze the components and find lines and recognize the image and at
last it will provide the text which is recognized from the input image.
The main pros of using
Tesseract OCR are building training is easy, it can recognize many languages. Better
the image quality (like size, contrast, lightning) better recognition. No OCR
software is 100 percent accurate. The number of errors mainly depends on type
of document and quality of the image. The cons of using Tesseract are
recognition rate, some text blocks were recognized more than 1 time.
Indian languages accuracy
Bengali -98%
Gujarati-97.21%
Hindi-90%
Kannada-33.3%
Oriya-96.3%
Gurmukhi-97%
Tamil-99%
Telegu-98.5%
Malayalam-90.22%
Click the link to
install Tesseract OCR
https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0-alpha.20200328.exe
https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v5.0.0-alpha.20200328.exe
No comments:
Post a Comment