Re: command line scanned pdf to text

Willem van der Walt <wvdwalt@xxxxxxxxxx> · Tue, 3 Nov 2015 07:14:13 +0200 (SAST)

cuneiform is IMHO a better OCR engine than tesseract.
It is available as a package under ubuntu.
Regards, Willem

On Mon, 2 Nov 2015, Cheryl Homiak wrote:

I am sure tiff is supported. It is really strange. I get what look like words and what I get is the same every time I do a scan of the same image but they are nonsense. I even tried adding the designation for English thinking somehow it wasn't using English but got the same results. I know the image file is okay because it comes out fine using ABBY FineReader Express on my Mac.

--
Cheryl

May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)

On Nov 2, 2015, at 10:15 PM, Tom Fowle <wa6ivgtf@xxxxxxxxxxx> wrote:

Sheryl,
I  arbitrarilly chose to convert the pdf to jpeg as tesseract doesn't do
pdf.

Then I just did
tesseract filename.jpg  outfile
produces
outfile.txt

sorry havn't tried .tif and I couldn't find a list of supported file types.

tom fowle

On Mon, Nov 02, 2015 at 02:53:45PM -0600, Cheryl Homiak wrote:
Would you mind enlarging on this if you can and have time? What kind of file did you use and what did you put in your command-line? I am asking this because I have tried to use tesseract a couple of times with tiff files and have gotten mostly gibberish so obviously I am doing something wrong. I am running debian testing if that makes a difference.

Thanks.

--
Cheryl

May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)

On Nov 2, 2015, at 2:13 PM, John G Heim <jheim@xxxxxxxxxxxxx> wrote:

I've been scanning in the D&D 5th Edition player's handbook. I tried every open source OCR program I could find and tesseract was easily the best. On pages that are just prose, it probably does about 99% accuracy. Even on pages where that are 2 columns of prose, it does really well if you tell it to look for that. Somebody sent me a pdf of the same book done with a professional OCR program for Windows. The results are approximately equal. Tesseract may lack the bells & whistles of commercial products but for accuracy, it's pretty good.

On 11/01/2015 11:24 PM, Tom Fowle wrote:
Am I the last to find this?
command line ocr tesseract
won't directly support .pdf but
pdftocairo
produces .jpg among others which tesseract will read.

May not do well with collumns but not too bad.

Is there anything better?

Thanks
tom Fowle
_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup

--
John Heim, jheim@xxxxxxxxxxxxx, 608-263-4189, skype:john.g.heim, sip:jheim@xxxxxxxxxxxxxxxx
_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup

_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup
_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup

_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup

--
This message is subject to the CSIR's copyright terms and conditions, e-mail legal notice, and implemented Open Document Format (ODF) standard. 
The full disclaimer details can be found at http://www.csir.co.za/disclaimer.html.

This message has been scanned for viruses and dangerous content by MailScanner, 
and is believed to be clean.

Please consider the environment before printing this email.
_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup