Re: Extracting ASCII text from a PDF Document

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Martin McCormick wrote:
> I have a PDF document that does have embedded ASCII text in it.
> 	I need to use the file on a Debian system so I hope I am
> just using a2ps and pstotext wrong.

Don't do that!  Use pdftotext instead.
On my distribution, ArchLinux, pdftotext is provided by the "poppler"
package.  I don't know which package you need for Debian.
Perhaps it's in xpdf.

One thing you'll notice when converting PDF to plain text is that certain
two-letter combinations are replaced with UTF-8-encoded Unicode characters.
Only the gods know why.
Common examples are fi, fl, and ff.
Of course, most screenreaders won't render those correctly.

-- Chris

_______________________________________________
Blinux-list mailing list
Blinux-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/blinux-list


[Index of Archives]     [Linux Speakup]     [Fedora]     [Linux Kernel]     [Yosemite News]     [Big List of Linux Books]