Re: to_ascii, or some other form of magic transliteration

Mike Rylander <mrylander@xxxxxxxxx> · Sun, 11 Sep 2005 14:57:51 +0000

On 9/10/05, Ben <bench@xxxxxxxxxxxxxxx> wrote:
> Hrm, I must be missing something, because I don't see how this will
> transliterate to ASCII?

If you want non-western text to be Romanized you can take a look at
Text::Unicode(1).  The functionality in the chunk of perl I sent
before was stripping of non spacing mark (accents, rings, umlauts and
such).  You may need to strip other character classes if you've got
unicode punctuation codepoints in the text to be searched.

For the example you gave, the process is to decompose the character
"å" to normalization form D, "a" and the unicode non spacing mark for
the ring, and then removing the non spacing mark (the ring diacritic)
with the regex s/\pM//sog.  That will leave just the ASCII "a" in the
text, and the text can the be treated as pure ASCII, because it no
longer contains any unicode codepoints with an ord() above 127.  You
may want to look here(2) for an explanation and examples of Unicode
normalization forms.

If you don't need that much functionality (handling arbitrary unicode
text), and you're dealing strictly with the Latin1 subset of unicode,
you can just create a mapping table or hash to transliterate down to
ASCII, as done here(3).

1) http://cpan.uwinnipeg.ca/htdocs/Text-Unidecode/Text/Unidecode.html 
2) http://www.unicode.org/unicode/reports/tr15/#Canonical_Composition_Examples
3) http://www.eprints.org/files/eprints2/eprints-2.2/defaultcfg/ArchiveTextIndexingConfig.pm

> 
> On Sep 10, 2005, at 5:30 AM, Mike Rylander wrote:
> 
> > On 9/9/05, Ben <bench@xxxxxxxxxxxxxxx> wrote:
> >
> >> I'm working on a problem that I imagine others have had, which
> >> basically
> >> boils down to having nice unicode display text that users are
> >> going to
> >> want to search against without typing it correctly.... e.g. let a
> >> search
> >> for "sma" match "små". It seems like the best way to do this is to
> >> find
> >> a magic unicode transliteration mapping function, and then save the
> >> ASCII transliterations for searching against.
> >>
> >>
> >
> > The simplest solution to this that I've found is to maintain a
> > separate column for ASCII-ized version of your text.  The conversion
> > can be done automatically using a trigger, and I have one in PL/PERLU
> > that I use.  It basically boils down to:
> >
> > 1) transform unicode text to normal form D
> > 2) strip combining non-spacing marks
> >
> > In modern Perls that looks like:
> >
> > #--------------
> > use Unicode::Normalize;
> > my $txt = NFD(shift());
> > $txt =~ s/\pM//og;
> > return $txt;
> > #--------------
> >
> > Hope that helps!
> >
> >
> 

-- 
Mike Rylander
mrylander@xxxxxxxxx
GPLS -- PINES Development
Database Developer
http://open-ils.org

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend