Re: PDF to text?

Joel Rees <joel.rees@xxxxxxxxx> · Wed, 17 Aug 2011 13:08:33 +0900

On Sat, Aug 13, 2011 at 2:41 AM, Bob Goodwin <bobgoodwin@xxxxxxxxxxxx> wrote:
> On 12/08/11 12:22, mike cloaked wrote:
[...]
>> However if the pdf is a scanned image then it would need ocr before
>> the text could be extracted -

As someone else noted, some recent scan-to-pdf tools try to pre-ocr
the text. Sometimes it's sort of helpful. Sometimes not so much.

Some pdf output tools actually bury the real text into the pdf as well
as an image of the text. But that's not scanning. This doesn't seem to
be the case, either.

>        I believe it is a scanned image now that I realize it has a
>        handwritten signature.
>
>        Xsane does ocr. I tried scanning a printed copy and letting
>        xsane save it as a text message as well as trying gocr to read
>        an xsane .pnm file. Both produced the same output which looks
>        like it would require a lot of work to be usable if it is
>        possible at all?
>
>        I will do without the Google translation.
>
>        Thanks for all the suggestions. This has been interesting, I
>        always wondered about ocr, what it could do. I need to
>        experiment with a document in English so that I have something I
>        understand however it looks like the output quality is poor?

ocr is still hit-and-miss. Some combinations of
languages/fonts/scanners/image format/paper quality/ocr software and
the price of 10base5 cable on Saipan work well. Others don't.

Well, probably not 10base5. :/

But the tuning is sometimes so time-intensive that you'd prefer to
just type it in by hand. On the other hand, if you have a lot of the
scanned text that comes from the same source, the tuning can be worth
it.

Don't ask me how to tune the ocr. Some years ago I read up on it and
decided, for that doc, I'd pass. Open source ocr seems to have
progressed since then, which is nice.

Joel Rees
-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines