Re: PDF to text?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/08/11 18:25, Cameron Simpson wrote:
> On 12Aug2011 12:09, Bob Goodwin<bobgoodwin@xxxxxxxxxxxx>  wrote:
> | On 12/08/11 12:04, Genes MailLists wrote:
> |>  On 08/12/2011 11:58 AM, Bob Goodwin wrote:
> |>>  On 12/08/11 11:22, Genes MailLists wrote:
> |>>>  On 08/12/2011 11:16 AM, Madhav Ancha wrote:
> |>>>      You could try this fedora app:  pdftotext
> |>>>
> |>>           As can be seen I tried several combinations, thought perhaps it
> |>>           couldn't handle the file nam in quotes "Couier  etc" but nothing
> |>>           seems to do it?
> |>>
> |>     Is it possible the PDF contains an image of the text rather than text
> |>  itself ?
> |
> |         I'm not sure, how would I tell? It's an attachment to an html
> |         cover letter. The Fedora default app, disolays it with no
> |         complaints.
>
> Is it ridiculously large for the amount of text? Does it seem to have
> scanner artifacts in the text - "graininess" if you peer closely, fuzzy
> text instead of perfectly formed letters (i.e. a picture of text instead
> of text rendered by your computer from a font)?
>
> Personally I use pdftohtml to convert PDFs (then an HTML-to-text
> pipeline on the end of that). Possibly pdftotext does exactly that
> anyway. Of course it achieves nothing for me if the PDF is a scan.
>
> Cheers,

        It's a scan.

        pdftohtml seems to have produced jpeg as well as html files.

            -rw-rw-r--. 1 bobg bobg  321444 Aug 12 18:37 Courier-1_1.jpg
            -rw-rw-r--. 1 bobg bobg  309493 Aug 12 18:37 Courier-2_1.jpg
            -rw-rw-r--. 1 bobg bobg     461 Aug 12 18:37 Courier.html
            -rw-rw-r--. 1 bobg bobg     244 Aug 12 18:37 Courier_ind.html

        The html files display as a couple of boxes, the jpegs are sharp
        reproductions of the text and can be converted with gocr to
        text. But the quality of that text leaves much to be desired. I
        might be able to work it over with a dictionary to fill in the
        missing words, missing being words that appear as gibberish.

        Thanks, I'll have a go at that later.

        Bob



-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines

[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [EPEL Devel]     [Fedora Magazine]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Desktop]     [Fedora Fonts]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Fedora Sparc]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux