I think you've done about what can be done. As for math out of PDF, I wouldn't expect much. PDF tends to be presentational, and not structural. I would think structural markup, as with Math ML, would be preferable. But, as you've discovered, PDF is still widely used. Christopher Brannon writes: > From: Christopher Brannon <cbrannon@xxxxxxxxxxx> > > Hi listers: > > I am a college student, majoring in Computer Science. I am now involved > in a research project dealing with the subject of cryptography. > And a lot of good material on the net is available only in a certain > file format, pdf. It doesn't convert well to text or html. Therefore, > it is not necessarily useful. My question is, how do I turn PDF > to html or text without loss of data? I will list some of the solutions > I've tried. > > Firstly, there is pstotext. This works well for most things, but it is > a loss for some of the files I am starting to encounter. > Next, there is the pdftotext utility from the xpdf distribution. It works > beautifully for most files. I found this, and I now use it instead of pstotext. > Still, it is sometimes a loss when the file contains mathematical formulae > and other symbols. > > There are the online conversion tools offered through access.adobe.com. > And of course, if the file contains mathematical formulae, there may be > lossage. > There may even be lossage when the file contains English words. For example, > one PDF file contained the word "modifications". The web-based tools produced > "modi cations" as output. > > All of the above solutions are good ones, and they work most of the time. > I'm certainly not putting any of these products down, by any stretch. > PDF is a complex file format. Writing a translator for it is certainly no > mean feat, I am sure. > > There is one more solution, but it is certainly less than optimal. > I've also used optical character recognition. You can turn a pdf into > a collection of .pnm bitmap image files using Ghostscript. Then run the > OCRShop utility from Vividata on the collection. > > So those are the possibilities for converting pdf to text under Linux, as > I see them. > None is perfect, though some work well. > What do I do? Help! > > BTW, I notice that html is getting better and better. The newest specs > make it possible to represent all sorts of symbolry, mathematical and > otherwise. One of our list members, Karl Dahlke, has proven the concept > with his math site (http://www.mathreference.com). > So if this is the case, we should hopefully be seeing more material in html > which previously might have been in pdf, shouldn't we? > But the reverse seems to be the case. I'm seeing a lot of pdf these days. > Has it always been this way? Maybe I'm just running into more pdf because > I'm researching cryptography. > Will it change in the forseeable future? > > > > _______________________________________________ > > Blinux-list@xxxxxxxxxx > https://listman.redhat.com/mailman/listinfo/blinux-list -- Janina Sajka, Director Technology Research and Development Governmental Relations Group American Foundation for the Blind (AFB) Email: janina@xxxxxxx Phone: (202) 408-8175 _______________________________________________ Blinux-list@xxxxxxxxxx https://listman.redhat.com/mailman/listinfo/blinux-list