On Tue, Nov 29, 2016 at 6:56 PM, rob stone <floriparob@xxxxxxxxx> wrote: > Hello Michael, > On Tue, 2016-11-29 at 19:10 -0500, Michael Sheaver wrote: >> Greetings, >> >> I have two tables that are populated using large datasets from >> disparate external systems, and I am trying to match records by >> customer name between these two tables. I do not have any >> authoritative key, such as customerID or nationalID, by which I can >> match them up, and I have found many cases where the same customer >> has different first names in the two datasets. A sampling of the >> differences is as follows: >> >> Michael <=> Mike >> Tom <=> Thomas >> Liz <=> Elizabeth >> Margaret <=> Maggie >> >> How can I build a query in PostgreSQL (v. 9.6) that will find >> possible matches like these on nicknames? My initial guess is that I >> would have to either find or build some sort of intermediary table >> that contains associated names like those above. Sometimes though, >> there will be more than matching pairs, like: >> >> Jim <=> James <=> Jimmy <=> Jimmie >> Bill <=> Will <=> Willie <=> William >> >> and so forth. >> >> Has anyone used or developed PostgreSQL queries that will find >> matches like these? I am running all my database queries. on my local >> laptops (Win7 and macOS), so performance or uptime is no issue here. >> I am curious to see how others in this community have creatively >> solved this common problem. >> >> One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might >> work here, but honestly I am clueless as to how to set this up or use >> it in queries successfully. >> >> Thanks, >> Michael (aka Mike, aka Mikey) >> > > Check out chapter F15 in the doco. > Try the double metaphone. > I worked on something similar many years ago cleaning up input created > by data entry clerks from hand written speeding tickets, so as to match > with "trusted" data held in a database. > As the volume of input was small in comparison with the number of > licensed drivers, we could iterate over and over again trying to match > it up. Also check out pg_trgm extension. It's better for addresses than names, but might be something to look at depending on how things turn up with the data. merlin -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general