Re: command line scanned pdf to text

Jude DaShiell <jdashiel@xxxxxxxxx> · Wed, 4 Nov 2015 11:21:48 -0500 (EST)

Thanks, I'll search archlinux and see if that shows up. On Wed, 4 Nov 
2015, John G Heim wrote:

Date: Wed, 4 Nov 2015 10:11:22
From: John G Heim <jheim@xxxxxxxxxxxxx>
Reply-To: Speakup is a screen review system for Linux.
    <speakup@xxxxxxxxxxxxxxxxx>
To: Speakup is a screen review system for Linux. <speakup@xxxxxxxxxxxxxxxxx>
Subject: Re: command line scanned pdf to text

On ubuntu it's tesseract-ocr-en.

On 11/04/2015 09:01 AM, Jude DaShiell wrote:
What data pack for tesseract has the english language in it?  I'm being
prompted to download a data pack and I figure best get what language I
understand rather than the whole data set since both memory and disk
space over here are not unlimited.

On Mon, 2 Nov 2015, Cheryl Homiak wrote:

Date: Mon, 2 Nov 2015 17:39:38
From: Cheryl Homiak <cah4110@xxxxxxxxxx>
Reply-To: Speakup is a screen review system for Linux.
    <speakup@xxxxxxxxxxxxxxxxx>
To: Speakup is a screen review system for Linux.
<speakup@xxxxxxxxxxxxxxxxx>
Subject: Re: command line scanned pdf to text

Thanks much. No, the way to get into a turned-off computer far away
hasn't been invented yet, unless you can turn it on by remote control
somehow - :-)
I suspect the error was mine so I won't give up on it yet.

Thanks.

--
Cheryl

May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)

On Nov 2, 2015, at 4:06 PM, John G Heim <jheim@xxxxxxxxxxxxx> wrote:

Huh, it strikes me as strange that tesseract didn't work for you. I
used tesseract last week to read a page in a pdf document that was
stored as an image. I used pdftohtml to extract the image and then
tesseract to convert it to text. I also pretty routinely use
tesseract to read screen capture images. It's not very accurate there
but it's usually good enough to make sense of.

Just "tesseract <infile> <outfile>" should work. The infile can be
the string "stdin" in which case it read from standard input. The
outfile can be "stdout" in which case it writes the text to stdout.
Right off hand, I do not have the command line I use to scan the D&D
book. It's on a computer at home that is turned off at the moment.
But I can post the whole thing tonight. Here are some lines from a
backup version of the script:

scanimage --format=tiff --mode Lineart --resolution 600 > /tmp/page.tiff
tesseract /tmp/page.tiff stdout

On 11/02/2015 02:53 PM, Cheryl Homiak wrote:
Would you mind enlarging on this if you can and have time? What kind
of file did you use and what did you put in your command-line? I am
asking this because I have tried to use tesseract a couple of times
with tiff files and have gotten mostly gibberish so obviously I am
doing something wrong. I am running debian testing if that makes a
difference.

Thanks.

--
John Heim, jheim@xxxxxxxxxxxxx, 608-263-4189, skype:john.g.heim,
sip:jheim@xxxxxxxxxxxxxxxx
_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup

_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup

--

_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup