Photo by GoodNotes 5 / Unsplash

Transform PDFs or Images into Editable Text with OCR using Python 🐍

Free OCR tool Sep 19, 2024

Introduction

https://pin.it/38rxkevUa

This article explains a simple method to read any PDF or image using Python's pytesseract module, along with a few other tools to convert PDFs into images. If you're already planning to work with images, that's great! Otherwise, we'll first programmatically convert PDF pages into images, and then extract readable text from them.

Steps

https://pin.it/4DckhVkOm
  1. Install Python if not already done.

For Linux machines

sudo apt install python3

#verify using 

python3 --version

For windows machines

Download Python
The official home of the Python Programming Language

2. Let's create a virtual environment. If you're not familiar with virtualenv, don't worry. Essentially, it creates an isolated environment where you can install Python packages specifically for a project. Without it, packages are usually installed globally on your machine, making them available to all your projects. Using virtualenv helps prevent conflicts between projects, so it's a good practice to set it up before starting any project.

To create one, follow these steps:

i)  Run pip3 install virtualenv.

ii) Then, use python3 -m virtualenv yourvirtualenvname to create your environment.

3. Now let's install the necessary packages -

This is the main package we will be using to read our pdf. I am using ocr-hin because I want to read Hindi content from a pdf

sudo apt-get install tesseract-ocr tesseract-ocr-hin

Now we will install the package to convert pdf to images and manage images using the Pillow library.

pip3 install pdf2image pytesseract Pillow

4. If you have a pdf you can use this code to first convert them into images.


# code to convert from pdf2image

from pdf2image import convert_from_path
pages = convert_from_path('your_pdf.pdf', 300)  # 300 DPI for better OCR accuracy
for i, page in enumerate(pages):
     page.save(f'page_{i}.png', 'PNG')
     

5.  Now we will just run a for loop through our pages to read all the pages and store text in a file. I am using OCR to translate Hindi and English both together so I am using lang='hin+eng'   parameters in my pytesseract.image_to_string function. You can also check other available language packs available on their website. And download using apt package manager or in windows normally as we do all the other applications. Link for other available language packs.


#code to convert from image to text

import pytesseract
from PIL import Image


mixed_text = ""
for i in range(2):  # Assuming pages is the list of converted images
    img = Image.open(f'page_{i}.png')
    
    # Specify both languages 'hin' and 'eng'
    text = pytesseract.image_to_string(img, lang='hin+eng')
    mixed_text += text + "\n"
    
#code to save your converted text in a file 
 
 with open('mixed_text_output.txt', 'w', encoding='utf-8') as f:
    f.write(mixed_text)
    

That's all if you need any help or have any questions just write it down in the comment section. I will be happy to help.

https://pin.it/3c1CWxEc0

Tags