pdf trouble

Christopher Brannon <cbrannon@xxxxxxxxxxx> · Sun, 30 Mar 2003 18:13:09 -0600

Hi listers:

I am a college student, majoring in Computer Science.  I am now involved
in a research project dealing with the subject of cryptography.
And a lot of good material on the net is available only in a certain
file format, pdf.  It doesn't convert well to text or html.  Therefore,
it is not necessarily useful.  My question is, how do I turn PDF
to html or text without loss of data?  I will list some of the solutions
I've tried.

Firstly, there is pstotext.  This works well for most things, but it is
a loss for some of the files I am starting to encounter.
Next, there is the pdftotext utility from the xpdf distribution.  It works
beautifully for most files.  I found this, and I now use it instead of pstotext.
Still, it is sometimes a loss when the file contains mathematical formulae
and other symbols.

There are the online conversion tools offered through access.adobe.com.
And of course, if the file contains mathematical formulae, there may be
lossage.
There may even be lossage when the file contains English words.  For example,
one PDF file contained the word "modifications".  The web-based tools produced
"modi cations" as output.

All of the above solutions are good ones, and they work most of the time.
I'm certainly not putting any of these products down, by any stretch.
PDF is a complex file format.  Writing a translator for it is certainly no
mean feat, I am sure.

There is one more solution, but it is certainly less than optimal.
I've also used optical character recognition.  You can turn a pdf into
a collection of .pnm bitmap image files using Ghostscript.  Then run the
OCRShop utility from Vividata on the collection.

So those are the possibilities for converting pdf to text under Linux, as
I see them.
None is perfect, though some work well.
What do I do?  Help!

BTW,  I notice that html is getting better and better.  The newest specs
make it possible to represent all sorts of symbolry, mathematical and
otherwise.  One of our list members, Karl Dahlke, has proven the concept
with his math site (http://www.mathreference.com).
So if this is the case, we should hopefully be seeing more material in html
which previously might have been in pdf, shouldn't we?
But the reverse seems to be the case.  I'm seeing a lot of pdf these days.
Has it always been this way?  Maybe I'm just running into more pdf because
I'm researching cryptography.
Will it change in the forseeable future?

_______________________________________________

Blinux-list@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/blinux-list