Posted:
Use Tesseract to OCR a multiple page PDF
Tesseract to OCR
Tesseract is a well known open source OCR engine. It is famous for its quality.
ghostscript to convert PDF to tiff
First we must convert our PDF to a tiff file.
Tesseract requires image files, so first we have to convert the PDF to images. When we use ghostscript for this, we will get high quality images.
Converting the PDF to tiff:
gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 \
-sOutputFile=output_filename.tiff input_filename.pdf
Convert tiff to text with Tesseract
tesseract output_filename.tiff text_file -l eng
The file text_file will become the output file. Tesseract will put an ".txt" extension to the filename.
Made with ♥ by a human - Proud member of the 250kb.club and the Blogroll.Club.