Re: pdf trouble

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I think you've done about what can be done. As for math out of PDF, I wouldn't expect much. PDF tends to be presentational, and not structural. I would think
structural markup, as with Math ML, would be preferable. But, as you've discovered, PDF is still widely used. 

Christopher Brannon writes:
> From: Christopher Brannon <cbrannon@xxxxxxxxxxx>
> 
> Hi listers:
> 
> I am a college student, majoring in Computer Science.  I am now involved
> in a research project dealing with the subject of cryptography.
> And a lot of good material on the net is available only in a certain
> file format, pdf.  It doesn't convert well to text or html.  Therefore,
> it is not necessarily useful.  My question is, how do I turn PDF
> to html or text without loss of data?  I will list some of the solutions
> I've tried.
> 
> Firstly, there is pstotext.  This works well for most things, but it is
> a loss for some of the files I am starting to encounter.
> Next, there is the pdftotext utility from the xpdf distribution.  It works
> beautifully for most files.  I found this, and I now use it instead of pstotext.
> Still, it is sometimes a loss when the file contains mathematical formulae
> and other symbols.
> 
> There are the online conversion tools offered through access.adobe.com.
> And of course, if the file contains mathematical formulae, there may be
> lossage.
> There may even be lossage when the file contains English words.  For example,
> one PDF file contained the word "modifications".  The web-based tools produced
> "modi cations" as output.
> 
> All of the above solutions are good ones, and they work most of the time.
> I'm certainly not putting any of these products down, by any stretch.
> PDF is a complex file format.  Writing a translator for it is certainly no
> mean feat, I am sure.
> 
> There is one more solution, but it is certainly less than optimal.
> I've also used optical character recognition.  You can turn a pdf into
> a collection of .pnm bitmap image files using Ghostscript.  Then run the
> OCRShop utility from Vividata on the collection.
> 
> So those are the possibilities for converting pdf to text under Linux, as
> I see them.
> None is perfect, though some work well.
> What do I do?  Help!
> 
> BTW,  I notice that html is getting better and better.  The newest specs
> make it possible to represent all sorts of symbolry, mathematical and
> otherwise.  One of our list members, Karl Dahlke, has proven the concept
> with his math site (http://www.mathreference.com).
> So if this is the case, we should hopefully be seeing more material in html
> which previously might have been in pdf, shouldn't we?
> But the reverse seems to be the case.  I'm seeing a lot of pdf these days.
> Has it always been this way?  Maybe I'm just running into more pdf because
> I'm researching cryptography.
> Will it change in the forseeable future?
> 
> 
> 
> _______________________________________________
> 
> Blinux-list@xxxxxxxxxx
> https://listman.redhat.com/mailman/listinfo/blinux-list

-- 
	
				Janina Sajka, Director
				Technology Research and Development
				Governmental Relations Group
				American Foundation for the Blind (AFB)

Email: janina@xxxxxxx		Phone: (202) 408-8175



_______________________________________________

Blinux-list@xxxxxxxxxx
https://listman.redhat.com/mailman/listinfo/blinux-list

[Index of Archives]     [Linux Speakup]     [Fedora]     [Linux Kernel]     [Yosemite News]     [Big List of Linux Books]