Re: command line scanned pdf to text

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Here is the complete script. Sorry I forgot to post it last night. I turned the machine on as I left this morning and sshed into it from work. Theresome junk in here you may or may not be interested in. You can pass the script 2 parameters. #1 is the page number.It uses this number to make the output text file name. Page 99 would be named p099.txt. If you don't pass it a page number, it looks for files matching the same pattern and takes the next highest number. So if there already is a p099.txt, it would create a p100.txt. The second parameter is the tesseract psm flag. The tesseract man page explains these. The default is 3.

After it's done with the scan and ocr, it concatenates all the pages into one big file. It also beeps if the new page it just scanned is an even numbered page. This is to remind me to turn the page. Otherwise I sometimes forget if I've already done both sides.


#!/bin/bash

IDX=$1
if [ ! -z "$IDX" ]; then
	TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
else
	for IDX in {1..999}; do
		TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
		test ! -f "${TEXT}" && break
	done
fi
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "

PSM="$2"
test -z "$PSM" && PSM=3

RESOLUTION=600
SCAN=/tmp/page.tif
scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN

PAGE=/tmp/page
tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
cat "${PAGE}.txt" >> "$TEXT"
/usr/bin/beep -r $((2 - IDX % 2))
test ! -z "$VERBOSE" && file "${TEXT}"
OUTFILE="/home/john/phb5/PHB5.txt"
echo "" > "$OUTFILE"
for IDX in {1..999}; do
	TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
	if [ -f "$TEXTFILE" ]; then
		echo "Page $IDX" >>  "$OUTFILE"
		cat "$TEXTFILE">> "$OUTFILE"
		echo -e "\f" >> "$OUTFILE"
	fi
done
# EOF

IDX=$1
if [ ! -z "$IDX" ]; then
	TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
else
	for IDX in {1..999}; do
		TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
		test ! -f "${TEXT}" && break
	done
fi
TEXT=`printf "/home/john/phb5/p%03d.txt" $IDX`
test ! -z "$VERBOSE" && echo "Working on page $IDX, ${TEXT} ... "

PSM="$2"
test -z "$PSM" && PSM=3

RESOLUTION=600
SCAN=/tmp/page.tif
scanimage --format=tiff --mode Lineart --resolution $RESOLUTION > $SCAN

PAGE=/tmp/page
tesseract -psm "$PSM" $SCAN $PAGE 2>&1 >/dev/null
cat "${PAGE}.txt" | cleantext >> "$TEXT"
/usr/bin/beep -r $((2 - IDX % 2))
test ! -z "$VERBOSE" && file "${TEXT}"
OUTFILE="/home/john/phb5/PHB5.txt"
echo "" > "$OUTFILE"
for IDX in {1..999}; do
	TEXTFILE=`printf "/home/john/phb5/p%03d.txt" $IDX`
	if [ -f "$TEXTFILE" ]; then
		echo "Page $IDX" >>  "$OUTFILE"
		cat "$TEXTFILE">> "$OUTFILE"
		echo -e "\f" >> "$OUTFILE"
	fi
done
# EOF

On 11/02/2015 04:39 PM, Cheryl Homiak wrote:
Thanks much. No, the way to get into a turned-off computer far away hasn't been invented yet, unless you can turn it on by remote control somehow - :-)
I suspect the error was mine so I won't give up on it yet.

Thanks.


--
John Heim, jheim@xxxxxxxxxxxxx, 608-263-4189, skype:john.g.heim, sip:jheim@xxxxxxxxxxxxxxxx
_______________________________________________
Speakup mailing list
Speakup@xxxxxxxxxxxxxxxxx
http://linux-speakup.org/cgi-bin/mailman/listinfo/speakup




[Index of Archives]     [Linux for the Blind]     [Fedora Discussioin]     [Linux Kernel]     [Yosemite News]     [Big List of Linux Books]
  Powered by Linux