Re: command line scanned pdf to text

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks. I did try another file and it worked in botyh cuneiform and tesseract so I think the two files I tried were an anomaly or it was a rotation issue. I haven't compared to see which package did the best job but it doesn't hurt to have both of them.


-- 
Cheryl

May the words of my mouth
and the meditation of my heart
be acceptable to You, Lord,
my rock and my Redeemer.
(Psalm 19:14 HCSB)





> On Nov 3, 2015, at 8:30 AM, John G Heim <jheim@xxxxxxxxxxxxx> wrote:
> 
> Here is the complete script.  Sorry I forgot to post it last night. I turned the machine on as I left this morning and sshed into it from work. Theresome junk in here you may or may not be interested in. You can pass the script 2 parameters. #1 is the page number.It uses this number to make the output text file name. Page 99 would be named p099.txt. If you don't pass it a page number, it looks for files matching the same pattern and takes the next highest number. So if there already is a p099.txt, it would create a p100.txt. The second parameter is the tesseract psm flag.  The tesseract man page explains these. The default is 3.
> 
> After it's done with the scan and ocr, it concatenates all the pages into one big file. It also beeps if the new page it just scanned is an even numbered page. This is to remind me to turn the page. Otherwise I sometimes forget if I've already done both sides.
> 
> 
> #!/bin/bash
> 
> IDX=$1
> if [ ! -z "$IDX" ]; then
> 	TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
> else
> 	for IDX in {1..999}; do
> 		TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
> 		test ! -f "${TEXT}" && break
> 	done
> fi
> TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
> test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "
> 
> PSM="$2"
> test -z "$PSM" && PSM=3
> 
> RESOLUTION=600
> SCAN=/tmp/page.tif
> scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN
> 
> PAGE=/tmp/page
> tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
> cat "${PAGE}.txt" >> "$TEXT"
> /usr/bin/beep -r $((2 - IDX % 2))
> test ! -z "$VERBOSE" && file "${TEXT}"
> OUTFILE="/home/john/phb5/PHB5.txt"
> echo "" > "$OUTFILE"
> for IDX in {1..999}; do
> 	TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
> 	if [ -f "$TEXTFILE" ]; then
> 		echo "Page $IDX" >>  "$OUTFILE"
> 		cat "$TEXTFILE">> "$OUTFILE"
> 		echo -e "\f" >> "$OUTFILE"
> 	fi
> done
> # EOF
> 
> IDX=$1
> if [ ! -z "$IDX" ]; then
> 	TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
> else
> 	for IDX in {1..999}; do
> 		TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
> 		test ! -f "${TEXT}" && break
> 	done
> fi
> TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
> test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "
> 
> PSM="$2"
> test -z "$PSM" && PSM=3
> 
> RESOLUTION=600
> SCAN=/tmp/page.tif
> scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN
> 
> PAGE=/tmp/page
> tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
> cat "${PAGE}.txt" | cleantext >> "$TEXT"
> /usr/bin/beep -r $((2 - IDX % 2))
> test ! -z "$VERBOSE" && file "${TEXT}"
> OUTFILE="/home/john/phb5/PHB5.txt"
> echo "" > "$OUTFILE"
> for IDX in {1..999}; do
> 	TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
> 	if [ -f "$TEXTFILE" ]; then
> 		echo "Page $IDX" >>  "$OUTFILE"
> 		cat "$TEXTFILE">> "$OUTFILE"
> 		echo -e "\f" >> "$OUTFILE"
> 	fi
> done
> # EOF
> 
> On 11/02/2015 04:39 PM, Cheryl Homiak wrote:
>> Thanks much. No, the way to get into a turned-off computer far away hasn't been invented yet, unless you can turn it on by remote control somehow - :-)
>> I suspect the error was mine so I won't give up on it yet.
>> 
>> Thanks.
>> 
> 
> -- 
> John Heim, jheim@xxxxxxxxxxxxx, 608-263-4189, skype:john.g.heim, sip:jheim@xxxxxxxxxxxxxxxx
> _______________________________________________
> Speakup mailing list
> Speakup@xxxxxxxxxxxxxxxxx
> http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup

_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup




[Index of Archives]     [Linux for the Blind]     [Fedora Discussioin]     [Linux Kernel]     [Yosemite News]     [Big List of Linux Books]
  Powered by Linux