Request for [API CHANGE] in spell checking: add new options to disable rule-based compounding

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I've started to add two new spell checking options to css.linguistic2.XLinguProperties (screen shot: https://wiki.documentfoundation.org/images/a/a2/Spelling_options_compound.png), which can improve spell checking a lot. Because API changes need more attention, please check the Rationale below, comment on the extension, or the caption of the check boxes, and the patch itself, especially if backwards compatibility is accidentally broken (I don't know about it.) It it's ok for you, my plan is to extend the help (and follow on the other Hunspell problems, e.g. too redundant suggestions in several cases).

From the commit description:
“For professional proofreaders, it can be more important to avoid the
mistakes of the rule-based compound word recognition, than to speed up proofreading. Disabling the following two new options will report all rule-based closed compound words (default in Dutch, German, Hungarian etc. dictionaries) and rule-based hyphenated compound words (all languages with BREAK usage in their Hunspell dictionaries): - "Accept possible closed compound words" - "Accept possible hyphenated compound words" For example, disabling the second one, dictionary word "scot-free" will be still correct word in English spell checking, but not the previously accepted compound "arbitrary-word-with-hyphen".”
Commit: https://git.libreoffice.org/core/+/57d79744c77eef96b4c2bd3b16e0a04317ffcf9e%5E%21

Rationale:

Spell checker of MS Office and Google Docs started to use the "common knowledge" by collecting words and user feedback from the internet. It's cheap and up-to-date, and likely good enough for writing private messages, but it's not for professional document editing (see for example user feedback of Word „new version of the spell checker is awful”: https://answers.microsoft.com/en-us/msoffice/forum/all/spell-check-problems/10078dbf-855a-4154-afb4-fac5e5c24ad8). Several languages, like Dutch, French, German, Hungarian use an academic approach, i.e. an orthography standardized by the government/national bodies, see for example the official status of Duden in Germany. A spell checker, which accepts spelling mistakes, because they are frequently used by the users, is the opposite of a spell checker, at least in a document editor. Thanks to the lazy approach of the other document editors, spell checker of Writer can be more attractive for the professionals than before. Hunspell and Hunspell dictionaries are not perfect either. An old request from the editors to disable the rule-based compound words optionally, because while rule-based approach eliminated the false alarms successfully (note: German-like orthography generated millions of “single-use” correct word forms, which not possible to list in a spelling dictionary), it resulted in the malfunction of spell checking: typos and missing spaces between words skipped by the spell checker frequently. Hunspell had got a successful solution to limit this in the most important cases: if the possible rule-based compound word is also a dictionary word with a serious spelling mistake, the word form was reported as a spelling mistakes (see REP and CHECKCOMPOUNDREP in https://github.com/hunspell/hunspell/blob/master/man/hunspell.5). The new Hunspell 1.7.2 added a similar feature to the rule-based compound words composed from 3 or more words (https://github.com/hunspell/hunspell/commit/ff3591b0f76950f13d73123d03a03edd9a892945). But this is not enough: other typos are still recognized as compound words by the rule-based compounding. The new options are not exactly new in the case of  Hungarian: Lightproof spell checker has already contained the options “Underline all typo-like compound words” and “Underline all generated compound words”. This feature is important enough to be available for all languages with the same potential problem. If the editor wants more realistic, i.e. strict dictionary-based spell checking, disables these new options, and with some effort, can fix the typos and the missing spaces without reading 300 pages of a book (or otherwise, too: reading the book does not guarantee that you will be able to spot typos).

Best regards,
László

[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux