Re: command line scanned pdf to text

John G Heim <jheim@xxxxxxxxxxxxx> · Mon, 2 Nov 2015 14:13:04 -0600

I've been scanning in the D&D 5th Edition player's handbook. I tried 
every open source OCR program I could find and tesseract was easily the 
best. On pages that are just prose, it probably does about 99% accuracy. 
Even on pages where that are 2 columns of prose, it does really well if 
you tell it to look for that. Somebody sent me a pdf of the same book 
done with a professional OCR program for Windows. The results are 
approximately equal. Tesseract may lack the bells & whistles of 
commercial products but for accuracy, it's pretty good.

On 11/01/2015 11:24 PM, Tom Fowle wrote:
Am I the last to find this?
  command line ocr tesseract
won't directly support .pdf but
pdftocairo
produces .jpg among others which tesseract will read.

May not do well with collumns but not too bad.

Is there anything better?

Thanks
tom Fowle
_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup

--
John Heim, jheim@xxxxxxxxxxxxx, 608-263-4189, skype:john.g.heim, 
sip:jheim@xxxxxxxxxxxxxxxx
_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup