tesseract OCR and page layout

Gary Stainburn <gary.stainburn@xxxxxxxxxxxxxx> · Tue, 8 Sep 2015 12:04:07 +0100

HI folks.

When I use pdftotext from poppler-utils I use the -layout argument to get the 
resulting text file to match the page layout as closely as possible to the 
PDF file.

This means that lines such as

line1col1   line1col2         line1col3
line2col1  line2col2         line3col3

are output as such.  However, when I use tesseract to extract text from PDF 
files that don't have embedded text I can't seem to get the same effect. Am I 
missing something with tesseract, or is that an alternative OCR that can give 
me what I want?
-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org