tesseract OCR and page layout
Gary Stainburn
gary.stainburn at ringways.co.uk
Tue Sep 8 11:04:07 UTC 2015
HI folks.
When I use pdftotext from poppler-utils I use the -layout argument to get the
resulting text file to match the page layout as closely as possible to the
PDF file.
This means that lines such as
line1col1 line1col2 line1col3
line2col1 line2col2 line3col3
are output as such. However, when I use tesseract to extract text from PDF
files that don't have embedded text I can't seem to get the same effect. Am I
missing something with tesseract, or is that an alternative OCR that can give
me what I want?
More information about the users
mailing list