Anyone able to OCR a PDF file?

pj@xxxxxxxxxx (pj at pjb.com.au) · Wed, 4 Jan 2012 20:24:26 +1000

Willem van der Walt wrote:
> The different ocr engines require different image formats.
> Some of them are really dum.

They probably derive from old code written without a
format-independent graphics library.

> I find that the best of the open-source engines is cuneiform.

Aha, interesting.  I've always used tesseract.  cuneiform is
in debian wheezy (testing) but not yet in debian stable... 

Depending on how the PDF was produced, it's possible that
  ps2txt filename.pdf
(a.k.a. ps2ascii) might help; I think it comes with ghostscript.

Regards,  Peter Billam

http://www.pjb.com.au       pj at pjb.com.au      (03) 6278 9410
"Was der Meister nicht kann,   verm?cht es der Knabe, h?tt er
 ihm immer gehorcht?"   Siegfried to Mime, from Act 1 Scene 2