Re: Searching for "bare" letters

Andrew Sullivan <ajs@xxxxxxxxxxxxxxx> · Mon, 3 Oct 2011 11:24:51 -0400

On Sun, Oct 02, 2011 at 05:45:48PM +0200, Reuven M. Lerner wrote:
> quite grateful for that.  (I really hadn't ever needed to deal with
> such issues in the past, having worked mostly with English and
> Hebrew, which don't have such accent marks.)

That isn't quite true about English.  We have words like coöperate and
naïve.  The former is sometimes fixed with a hyphen instead, but the
latter can't be.

I think what happened is that English speakers, because we're already
used to being sloppy (you can't tell what's a subjunctive in English,
either, just by looking) were willing to adapt our spelling to reflect
the limitations of typewriters.  Also, English never really had an
official standard spelling -- by the time the English were attempting
to standardize seriously, there was already an American branch with
its own Bossypants Official Reformer of Spelling ("BORS", which in
that case was Noah Webster.  See G.B. Shaw for a British example).  So
we mostly lost the accents in standard spelling.  We also lost various
standard digraphs, like that in encyclopædia (which, depending on
which branch of nonsense you subscribe to, can be spelled instead
"encyclopedia" or "encyclopaedia"; both would have been called "wrong"
once upon a time).

> As for the unaccent dictionary, I hadn't heard of it before, but
> just saw it now in contrib, and it looks like it might fit
> perfectly.  I'll take a look; thanks for the suggestion.

The big problem there is what someone else pointed to up-thread: in
some languages, the natural thing to do is to transliterate using
multiple characters.  The usual example is that in German is it common
to use "e" after a vowel to approximate the umlaut.  So, "ö" becomes
"oe".  Unfortunately, in Swedish this is clearly a mistake, and if you
can't use the diaeresis, then you just use the "undecorated" character
instead.  The famous Swedish ship called the Götheborg cannot be
transliterated as Goetheborg.  Even in German, the rule is
complicated, because it's not two-way: you can't spell the famous
writer's name Göthe (even though Google seems to think you can).

As far as I can tell, the unaccent dictionary doesn't handle the
two-character case, though it sure looks like it could be extended to
do it.  But it doesn't seem to have a facility for differentiating
based on the language of the string.  I don't know whether that could
be added.

The upshot is that, if you need to store multilingual input and do
special handling on the strings afterwards, you are wise to store the
string with a language tag so that you can apply the right rules later
on.  See RFC 5646 (http://www.rfc-editor.org/rfc/rfc5646.txt) for some
pointers.  If just "stripping accents" is good enough for you, then
the unaccent dictionary will probably be good enough.

A

-- 
Andrew Sullivan
ajs@xxxxxxxxxxxxxxx

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general