Enjoying Open Source Software

Use Tesseract to OCR a multiple page PDF

Tesseract to OCR

Tesseract is a well known open source OCR engine. It is famous for its quality.

ghostscript to convert PDF to tiff

First we must convert our PDF to a tiff file.

Tesseract requires image files, so first we have to convert the PDF to images. When we use ghostscript for this, we will get high quality images.

Converting the PDF to tiff:

gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 \
 -sOutputFile=output_filename.tiff input_filename.pdf

Convert tiff to text with Tesseract

tesseract output_filename.tiff text_file -l eng

The file text_file will become the ouput file. Tesseract will put an ".txt" extension to the filename.

Tags: tesseract