Posted: 2018-05-30

Use Tesseract to OCR a multiple page PDF

Tesseract to OCR

Tesseract is a well known open source OCR engine. It is famous for its quality.

First we must convert our PDF to a tiff file.

Tesseract requires image files, so first we have to convert the PDF to images. When we use ghostscript for this, we will get high quality images.

gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 \
 -sOutputFile=output_filename.tiff input_filename.pdf

tesseract output_filename.tiff text_file -l eng

The file text_file will become the output file. Tesseract will put an ".txt" extension to the filename.

Made with ♥ by a human — no cookies, no trackers.
Proud member of the 250kb.club, the no-JS.club, the Blogroll.Club, and the Bukmark.Club.
Don’t let a billionaire’s algorithm control what you read — use an RSS feed reader.