On Sun, Oct 02, 2011 at 05:45:48PM +0200, Reuven M. Lerner wrote: > quite grateful for that. (I really hadn't ever needed to deal with > such issues in the past, having worked mostly with English and > Hebrew, which don't have such accent marks.) That isn't quite true about English. We have words like coöperate and naïve. The former is sometimes fixed with a hyphen instead, but the latter can't be. I think what happened is that English speakers, because we're already used to being sloppy (you can't tell what's a subjunctive in English, either, just by looking) were willing to adapt our spelling to reflect the limitations of typewriters. Also, English never really had an official standard spelling -- by the time the English were attempting to standardize seriously, there was already an American branch with its own Bossypants Official Reformer of Spelling ("BORS", which in that case was Noah Webster. See G.B. Shaw for a British example). So we mostly lost the accents in standard spelling. We also lost various standard digraphs, like that in encyclopædia (which, depending on which branch of nonsense you subscribe to, can be spelled instead "encyclopedia" or "encyclopaedia"; both would have been called "wrong" once upon a time). > As for the unaccent dictionary, I hadn't heard of it before, but > just saw it now in contrib, and it looks like it might fit > perfectly. I'll take a look; thanks for the suggestion. The big problem there is what someone else pointed to up-thread: in some languages, the natural thing to do is to transliterate using multiple characters. The usual example is that in German is it common to use "e" after a vowel to approximate the umlaut. So, "ö" becomes "oe". Unfortunately, in Swedish this is clearly a mistake, and if you can't use the diaeresis, then you just use the "undecorated" character instead. The famous Swedish ship called the Götheborg cannot be transliterated as Goetheborg. Even in German, the rule is complicated, because it's not two-way: you can't spell the famous writer's name Göthe (even though Google seems to think you can). As far as I can tell, the unaccent dictionary doesn't handle the two-character case, though it sure looks like it could be extended to do it. But it doesn't seem to have a facility for differentiating based on the language of the string. I don't know whether that could be added. The upshot is that, if you need to store multilingual input and do special handling on the strings afterwards, you are wise to store the string with a language tag so that you can apply the right rules later on. See RFC 5646 (http://www.rfc-editor.org/rfc/rfc5646.txt) for some pointers. If just "stripping accents" is good enough for you, then the unaccent dictionary will probably be good enough. A -- Andrew Sullivan ajs@xxxxxxxxxxxxxxx -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general