new OCR project for Linux!

jason@xxxxxxxxxxxx (Jason White) · Sat, 4 Apr 2009 04:55:26 +0000 (UTC)

Marcel Oats  <speakup at braille.uwo.ca> wrote:
>I'd like a PDF converter for Linux.  Any ideas?

I would like to discuss the aspect of this issue which is on topic for this
thread.

Some PDF documents contain only scanned images of the printed pages; there is
no character-encoded text in such files.

Given an OCR system, it should be possible to convert such files to text by
extracting the page images using pdfimage (part of Xpdf), performing any
conversions that may be necessary, then processing the image files with OCR.

The quality of the output depends, of course, on the accuracy of the OCR
system and the characteristics of the page images.