PRAGYAN NCMS SCIENCE FORUM: Extract text from Images/PDF with Tesseract OCR

Puneetha D S
5th Sem BCA, NCMS

Do you want to convert image/pdf into text? Then you will need an application that can recognize text using OCR(optical character recognition). OCR uses artificial intelligence for text search and its recognition on image.

Tesseract is an open source text recognition engine which is available under Apache license2.0. It is free software, originally developed by Hewlett-Packard and has been funded by GOOGLE. Tesseract can extract text from image directly or using API. Even it can recognize wide variety of languages. It is easy to install and use.

Tesseract is written in C and C++ and it is available for Linux, Windows and macOS operating systems. It is compatible with many programming languages and frame works. It recognizes the pixels, letters, words, and sentences on the image. When we input image it will analyze the adaptive thresholding and converts to binary image and it analyze the components and find lines and recognize the image and at last it will provide the text which is recognized from the input image.

The main pros of using Tesseract OCR are building training is easy, it can recognize many languages. Better the image quality (like size, contrast, lightning) better recognition. No OCR software is 100 percent accurate. The number of errors mainly depends on type of document and quality of the image. The cons of using Tesseract are recognition rate, some text blocks were recognized more than 1 time.

Indian languages accuracy
Bengali -98%
Gujarati-97.21%
Hindi-90%
Kannada-33.3%
Oriya-96.3%
Gurmukhi-97%
Tamil-99%
Telegu-98.5%
Malayalam-90.22%

Click the link to install Tesseract OCR

https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w64-setup-v5.0.0-alpha.20200328.exe

https://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-w32-setup-v5.0.0-alpha.20200328.exe

PRAGYAN NCMS SCIENCE FORUM

Wednesday, November 11, 2020

Extract text from Images/PDF with Tesseract OCR

Puneetha D S
5th Sem BCA, NCMS

Indian languages accuracy
Bengali -98%
Gujarati-97.21%
Hindi-90%
Kannada-33.3%
Oriya-96.3%
Gurmukhi-97%
Tamil-99%
Telegu-98.5%
Malayalam-90.22%

No comments:

Post a Comment

The Collatz Conjecture: A Simple Problem with Complex Implications

Report Abuse

Wednesday, November 11, 2020

Extract text from Images/PDF with Tesseract OCR

Puneetha D S5th Sem BCA, NCMS

Indian languages accuracyBengali -98%Gujarati-97.21%Hindi-90%Kannada-33.3%Oriya-96.3%Gurmukhi-97%Tamil-99%Telegu-98.5%Malayalam-90.22%

No comments:

Post a Comment

The Collatz Conjecture: A Simple Problem with Complex Implications

Puneetha D S
5th Sem BCA, NCMS

Indian languages accuracy
Bengali -98%
Gujarati-97.21%
Hindi-90%
Kannada-33.3%
Oriya-96.3%
Gurmukhi-97%
Tamil-99%
Telegu-98.5%
Malayalam-90.22%