Re: SCanning text of PDF documents

Frank Arensmeier <frank@xxxxxxxxxxxx> · Thu, 15 May 2008 11:40:56 +0200

A reliable solution depends partly on the pdf document itself.  
Consider if your pdf document contains roted text or text that spans  
about several different blocks/pages. My experience with ps2acsii and  
other ghostscript related tools is that sometimes it works quite  
well, sometimes the output is rather messy.

The most reliable way of extracting text from a pdf is (I think) a  
product called PDF TET from PDFlib Gmbh. Yes, it costs some money for  
a license, but you are able to get almost everything out of the pdf  
then.

http://www.pdflib.com/products/tet/

Maybe some magic with OpenOffice could do the trick as well?

//frank

15 maj 2008 kl. 10.19 skrev Angelo Zanetti:

Hi All.

This is a quick question.

A client of ours wants a solution that when a PDF document is  
uploaded that
we use PHP to scan the documents contents and save it in a DB.

I know you can do this with normal text documents using the file  
commands
and functions.

Is it possible with PDF documents?

My feeling is NO, but perhaps someone will prove me wrong.

Thanks in advance.

Angelo

Web: http://www.elemental.co.za

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Frank Arensmeier
........................................................................ 
........................
Webmaster & IT Development

NIKE Hydraulics AB
Box 1107
631 80 Eskilstuna
Sweden

phone +46 - (0)16 16 82 34
fax +46 - (0)16 13 93 16
frank@xxxxxxxxxxxx
www.nikehydraulics.se
........................................................................ 
........................