Anyone able to OCR a PDF file?

jason@xxxxxxxxxxxx (Jason White) · Thu, 5 Jan 2012 09:38:45 +0000 (UTC)

 <speakup at braille.uwo.ca> wrote:
>Willem van der Walt wrote:

>> I find that the best of the open-source engines is cuneiform.
>
>Aha, interesting.  I've always used tesseract.  cuneiform is
>in debian wheezy (testing) but not yet in debian stable... 

It is now officially unmaintained upstream. If you like it and you know
someone familiar with OCR algorithms who has time to spare, or someone who
might know such a person, it's time to establish the right connections.

I occasionally monitor the lists for Cuneiform and OCR Opus.
>
>Depending on how the PDF was produced, it's possible that
>  ps2txt filename.pdf
>(a.k.a. ps2ascii) might help; I think it comes with ghostscript.

Pdftotext and Pdftohtml (as well as similar tools) will work, but only if
there is text in the PDF files.  If there are only images of text rather than
characters, you have to apply OCR. The size of the PDF file usually gives a
strong indication of whether it contains rasterized images or not, and of
course you can use the tools in poppler-utils to find out.