Optical Character Recognition is an old, but still challenging problem that involves the detection and recognition of text from unstructured data, including images and PDF documents. It has cool applications in banking, e-commerce and content moderation in social media.
But as with everything topic in data science, there is a huge amount of resources when trying to learn how to solve the OCR task. This is why I am writing this tutorial, which can help you on getting started.
In this article, I am going to show some Python libraries that can allow you to fastly extract text from images without struggling too much. The explanation of the libraries is followed by a practical example. The dataset used is taken from Kaggle. To simplify the concepts, I am just using an image of the film Rush.
Let’s get started!
Table of contents:
It is one of the most popular Python libraries for optical character recognition. It uses Google’s Tesseract-OCR Engine to extract text from images. There are multiple languages supported. Check here if you want to see if your language is supported. You just need a few lines of code to convert the image into text:
!sudo apt install tesseract-ocr
!pip install pytesseract
from pytesseract import Output
from PIL import Image
img_path1 = '00b5b88720f35a22.jpg'
text = pytesseract.image_to_string(img_path1,lang='eng')
This is the output: