Wednesday, November 11, 2020

Extract text from Images/PDF with Tesseract OCR

 Puneetha D S
5th Sem BCA, NCMS


Do you want to convert image/pdf into text? Then you will need an application that can recognize text using
OCR(optical character recognition). OCR uses artificial intelligence for text search and its recognition on image.

Tesseract is an open source text recognition engine which is available under Apache license2.0. It is free software, originally developed by Hewlett-Packard  and has been funded by GOOGLE. Tesseract can extract text from image directly or using API. Even it can recognize wide variety of languagesIt is easy to install and use.

Tesseract is written in C and C++ and it is available for Linux, Windows and macOS operating systems. It is compatible with many programming languages and frame works. It recognizes the pixels, letters, words, and sentences on the image. When we input image it will analyze the adaptive thresholding and converts to binary image and it analyze the components and find lines and recognize the image and at last it will provide the text which is recognized from the input image.

The main pros of using Tesseract OCR are building training is easy, it can recognize many languages. Better the image quality (like size, contrast, lightning) better recognition. No OCR software is 100 percent accurate. The number of errors mainly depends on type of document and quality of the image. The cons of using Tesseract are recognition rate, some text blocks were recognized more than 1 time.

Indian languages accuracy
Bengali -98%
Gujarati-97.21%
Hindi-90%
Kannada-33.3%
Oriya-96.3%
Gurmukhi-97%
Tamil-99%
Telegu-98.5%
Malayalam-90.22%



No comments:

Post a Comment

AI IN CRYPTOGRAPHY

Written by: PALLAVI V (Final year BCA) 1.     ABSTRACT: The integration of AI in Cryptography represents a significant advancement in ...