2008/12/20 Daniel Dalton <d.dalton@xxxxxxxxxxxx>: The images seem to > place random code in the doc (that's ok, some quick editing with emacs, > nano, vi or your favourite editor will fix that. I suggest creating a mental note to examine the randomness of the unwanted code and other mistakes on an ongoing basis. To the extent that it is repetitive between different documents with different characteristics, a cleanup script can be written to handle it, which might make a good community project if there is not already such a project associated with one of the apps. Seemingly random characters produced by the OCR process often have patterns that can be processed by a regex, e.g., an unusual Unicode special character in a "word." Reviewing source code for the document can point the path to e.g., the symbol's Unicode number, which is a character entity written in plain text that can be processed by a script. Particular character combinations are also often handled poorly by OCR because their combination appears visually as very similar to another character. E.g., "rn" is often mistranslated as "m." Throw in variation in typefaces and the quality of the source document, you'll have the same errors occurring over and over again. Building a quality list of recurring "words" that are not words and their correct equivalents can also provide the input for an automagic substitution routine in the clean-up script. Many OCR errors result from the variability in type faces. For frequently read publications like a newspaper, it can be helpful to build a "not-word" list adapted for the particular type face used in the publication which can then be used for other publications that share the same or a very similar type face. Involving sighted people who have good type face recognition skills could be of assistance here. Often it is unnecessary to identify the particular type face so long as it can be recognized as within a certain classification of type faces. A database for typeface classifications used by particular frequently read publications can also play into the quality of clean-up scripts. Just some random thoughts from a sighted person who has struggled with OCR over the decades. I was a typographer in my first career. Best regards, Paul -- Universal Interoperability Council <http:www.universal-interop-council.org> _______________________________________________ Blinux-list mailing list Blinux-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/blinux-list