That's a good idea, I didn't think of that, I guess I should invest some time into writing something like this. On Sat, Dec 20, 2008 at 05:22:59AM -0800, marbux wrote: > 2008/12/20 Daniel Dalton <d.dalton@xxxxxxxxxxxx>: > The images seem to > > place random code in the doc (that's ok, some quick editing with emacs, > > nano, vi or your favourite editor will fix that. > > I suggest creating a mental note to examine the randomness of the > unwanted code and other mistakes on an ongoing basis. To the extent > that it is repetitive between different documents with different > characteristics, a cleanup script can be written to handle it, which > might make a good community project if there is not already such a > project associated with one of the apps. > > Seemingly random characters produced by the OCR process often have > patterns that can be processed by a regex, e.g., an unusual Unicode > special character in a "word." Reviewing source code for the document > can point the path to e.g., the symbol's Unicode number, which is a > character entity written in plain text that can be processed by a > script. > > Particular character combinations are also often handled poorly by OCR > because their combination appears visually as very similar to another > character. E.g., "rn" is often mistranslated as "m." Throw in > variation in typefaces and the quality of the source document, you'll > have the same errors occurring over and over again. > > Building a quality list of recurring "words" that are not words and > their correct equivalents can also provide the input for an automagic > substitution routine in the clean-up script. > > Many OCR errors result from the variability in type faces. For > frequently read publications like a newspaper, it can be helpful to > build a "not-word" list adapted for the particular type face used in > the publication which can then be used for other publications that > share the same or a very similar type face. Involving sighted people > who have good type face recognition skills could be of assistance > here. Often it is unnecessary to identify the particular type face so > long as it can be recognized as within a certain classification of > type faces. > > A database for typeface classifications used by particular frequently > read publications can also play into the quality of clean-up scripts. > > Just some random thoughts from a sighted person who has struggled with > OCR over the decades. I was a typographer in my first career. > > Best regards, > > Paul > > > > > -- > Universal Interoperability Council > <http:www.universal-interop-council.org> > > _______________________________________________ > Blinux-list mailing list > Blinux-list@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/blinux-list _______________________________________________ Blinux-list mailing list Blinux-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/blinux-list