<speakup at braille.uwo.ca> wrote: >Willem van der Walt wrote: >> I find that the best of the open-source engines is cuneiform. > >Aha, interesting. I've always used tesseract. cuneiform is >in debian wheezy (testing) but not yet in debian stable... It is now officially unmaintained upstream. If you like it and you know someone familiar with OCR algorithms who has time to spare, or someone who might know such a person, it's time to establish the right connections. I occasionally monitor the lists for Cuneiform and OCR Opus. > >Depending on how the PDF was produced, it's possible that > ps2txt filename.pdf >(a.k.a. ps2ascii) might help; I think it comes with ghostscript. Pdftotext and Pdftohtml (as well as similar tools) will work, but only if there is text in the PDF files. If there are only images of text rather than characters, you have to apply OCR. The size of the PDF file usually gives a strong indication of whether it contains rasterized images or not, and of course you can use the tools in poppler-utils to find out.