RE: PDF to Text

"Richard Lynch" <ceo@xxxxxxxxx> · Thu, 20 Apr 2006 21:23:50 -0500 (CDT)

On Thu, April 20, 2006 8:59 pm, Jay Blanchard wrote:
> [snip]
>> I am trying to find a way for a program to search through the text
>> on
> a
>> PDF. My first thought was to use pdftotext, but the PDFs generated
>> by
> our
>> commercial scanner/copier/printer machine do not seem to work with
>> pdftotext... it just outputs two CRLFs.  I've been looking around on
> the
>> net for something similar that might work.
>>
>> Anyone know of something like that?
>>
>> Thanks,
>> --
>> Ray Hauge
>
> Things I forgot to post:
>
> It is a PHP script.  I was planning on using shell_exec() to call the
> program
> and read the output from stdout.
> [/snip]
>
> Sounds like the PDF's are images and therefore will not be readable by
> anything, save for eyeballs. I have run into this quite a bit. The
> scanner scans the doc via a TWAIN driver, which then converts the info
> into an image of that which was scanned. It would be like trying to
> read
> text programmatically from a JPEG.....not really possible.

Actually, it's "possible" just bloody difficult.

You're looking into a topic known as OCR (Optical Character Recognition).

One OS project for this is:
GOCR (aka JOCR)
It's GOCR on freshmeat and JOCR on sourceforge because they name they
wanted was "taken" by another project. :-(

A commercial product known as OmniPages is probably the "best"
solution, unfortunately.

Some interesting options.

I've been thinking of maybe maybe writing a 'real' extension to PHP,
and GOCR/JOCR is one of the candidates I'd consider...

You also could, theoretically, convert the PDF to an image of some
kind,  pull it into GD, and then roll your own package based around:
http://php.net/imagecolorat
-- along with a zillion lines of code to reduce noise, detect edges,
and compute "distance" between two glyphs...

I did something like this on a very very very small and limited scale
recently, but it's not code I can publish nor is it truly useful to
you anyway.

Your best bet at this point is to search for "PDF OCR" and/or "PDF to
image" and then "OCR" separately and hope to find two packages
together that will suit your needs.

Note that OCR is, at best, only going to correctly convert ~95% of the
PDF into text.

If you need error-free conversion, forget software automation and do
it by hand, or count on a human intervention step in the process to
correct the transcription, because you will NOT get 100%

Even ~9x% assumes good clean images and a lot of factors in the
image-quality can lower that drastically fast.

-- 
Like Music?
http://l-i-e.com/artists.htm

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php