Re: Searching for "bare" letters

Eduardo Morras <nec556@xxxxxxxxxx> · Sun, 02 Oct 2011 23:03:18 +0200

At 01:25 02/10/2011, Reuven M. Lerner wrote:

Hi, everyone.Â  I'm working on a project on 
PostgreSQL 9.0 (soon to be upgraded to 9.1, 
given that we haven't yet launched).Â  The 
project will involve numerous text fields 
containing English, Spanish, and 
Portuguese.Â  Some of those text fields will be 
searchable by the user.Â  That's easy enough to 
do; for our purposes, I was planning to use some 
combination of LIKE searches; the database is 
small enough that this doesn't take very much 
time, and we don't expect the number of 
searchable records (or columns within those records) to be all that large.

The thing is, the people running the site want 
searches to work on what I'm calling (for lack 
of a better term) "bare" letters.Â  That is, if 
the user searches for "n", then the search 
should also match Spanish words containing 
"Ã±".Â  I'm told by Spanish-speaking members of 
the team that this is how they would expect 
searches to work.Â  However, when I just did a 
quick test using a UTF-8 encoded 9.0 database, I 
found that PostgreSQL didn'tÂ  see the two 
characters as identical.Â  (I must say, this is 
the behavior that I would have expected, had the 
Spanish-speaking team member not said anything on the subject.)

So my question is whether I can somehow wrangle 
PostgreSQL into thinking that "n" and "Ã±" are 
the same character for search purposes, or if I 
need to do something else -- use regexps, keep a 
"naked," searchable version of each column 
alongside the native one, or something else entirely -- to get this to work.

Any ideas?

You can use perceptual hashing for that. There 
are multiple algorithms, some of them can be tuned for specific languages.

See this documentation:

http://en.wikipedia.org/wiki/Phonetic_algorithm for a general description,

http://en.wikipedia.org/wiki/Soundex is the first one developed, very old,

http://en.wikipedia.org/wiki/Metaphone is a 
family of several modern algorithms.

Remember that they are hashing algorithms, some 
words can collide because they have the same pronunciation but write different.

I remember that datapark search engine uses them 
with dictionaries. You can check it too.

http://www.dataparksearch.org/

Thanks,

Reuven

HTH 

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general