tesseract OCR and page layout

Gary Stainburn gary.stainburn at ringways.co.uk
Tue Sep 8 11:04:07 UTC 2015


HI folks.

When I use pdftotext from poppler-utils I use the -layout argument to get the 
resulting text file to match the page layout as closely as possible to the 
PDF file.

This means that lines such as

line1col1   line1col2         line1col3
line2col1  line2col2         line3col3

are output as such.  However, when I use tesseract to extract text from PDF 
files that don't have embedded text I can't seem to get the same effect. Am I 
missing something with tesseract, or is that an alternative OCR that can give 
me what I want?


More information about the users mailing list