On 9/10/05, Ben <bench@xxxxxxxxxxxxxxx> wrote: > Hrm, I must be missing something, because I don't see how this will > transliterate to ASCII? If you want non-western text to be Romanized you can take a look at Text::Unicode(1). The functionality in the chunk of perl I sent before was stripping of non spacing mark (accents, rings, umlauts and such). You may need to strip other character classes if you've got unicode punctuation codepoints in the text to be searched. For the example you gave, the process is to decompose the character "å" to normalization form D, "a" and the unicode non spacing mark for the ring, and then removing the non spacing mark (the ring diacritic) with the regex s/\pM//sog. That will leave just the ASCII "a" in the text, and the text can the be treated as pure ASCII, because it no longer contains any unicode codepoints with an ord() above 127. You may want to look here(2) for an explanation and examples of Unicode normalization forms. If you don't need that much functionality (handling arbitrary unicode text), and you're dealing strictly with the Latin1 subset of unicode, you can just create a mapping table or hash to transliterate down to ASCII, as done here(3). 1) http://cpan.uwinnipeg.ca/htdocs/Text-Unidecode/Text/Unidecode.html 2) http://www.unicode.org/unicode/reports/tr15/#Canonical_Composition_Examples 3) http://www.eprints.org/files/eprints2/eprints-2.2/defaultcfg/ArchiveTextIndexingConfig.pm > > On Sep 10, 2005, at 5:30 AM, Mike Rylander wrote: > > > On 9/9/05, Ben <bench@xxxxxxxxxxxxxxx> wrote: > > > >> I'm working on a problem that I imagine others have had, which > >> basically > >> boils down to having nice unicode display text that users are > >> going to > >> want to search against without typing it correctly.... e.g. let a > >> search > >> for "sma" match "små". It seems like the best way to do this is to > >> find > >> a magic unicode transliteration mapping function, and then save the > >> ASCII transliterations for searching against. > >> > >> > > > > The simplest solution to this that I've found is to maintain a > > separate column for ASCII-ized version of your text. The conversion > > can be done automatically using a trigger, and I have one in PL/PERLU > > that I use. It basically boils down to: > > > > 1) transform unicode text to normal form D > > 2) strip combining non-spacing marks > > > > In modern Perls that looks like: > > > > #-------------- > > use Unicode::Normalize; > > my $txt = NFD(shift()); > > $txt =~ s/\pM//og; > > return $txt; > > #-------------- > > > > Hope that helps! > > > > > -- Mike Rylander mrylander@xxxxxxxxx GPLS -- PINES Development Database Developer http://open-ils.org ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend