Re: command line scanned pdf to text

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




I've been scanning in the D&D 5th Edition player's handbook. I tried every open source OCR program I could find and tesseract was easily the best. On pages that are just prose, it probably does about 99% accuracy. Even on pages where that are 2 columns of prose, it does really well if you tell it to look for that. Somebody sent me a pdf of the same book done with a professional OCR program for Windows. The results are approximately equal. Tesseract may lack the bells & whistles of commercial products but for accuracy, it's pretty good.



On 11/01/2015 11:24 PM, Tom Fowle wrote:
Am I the last to find this?
  command line ocr tesseract
won't directly support .pdf but
pdftocairo
produces .jpg among others which tesseract will read.

May not do well with collumns but not too bad.

Is there anything better?

Thanks
tom Fowle
_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup


--
John Heim, jheim@xxxxxxxxxxxxx, 608-263-4189, skype:john.g.heim, sip:jheim@xxxxxxxxxxxxxxxx
_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup




[Index of Archives]     [Linux for the Blind]     [Fedora Discussioin]     [Linux Kernel]     [Yosemite News]     [Big List of Linux Books]
  Powered by Linux