On Thu, Sep 16, 2010 at 04:40:42AM +0200, Andreas wrote: > I need to clean up a lot of contact data because of a merge of customer > lists that used to be kept separate. > I allready know that there are double entries within the lists and they > do overlap, too. > > Relevant fields could be name, street, zip, city, phone GROUP BY is your friend here; you basically want to normalise things as much as possible and then GROUP BY counting number of duplicates and where this count is greater than one you need to intervene somehow. Humans are great at typos (2% of records seem to contain a typo of some sort in my experience, with almost all of them not mattering) so the first thing would be to correct the typos (or use algorithms that are less susceptible to typos) and to start getting things normalised. Free form text is a bit of a fiddle to normalise, but the fuzzystrmatch[1] module in PG can help with this. An example of query that I do a lot of is: SELECT soundex(city), array_agg(DISTINCT city) -- array_accum for 8.3 and earlier FROM tbl GROUP BY 1 HAVING COUNT(DISTINCT city) > 1 ORDER BY 2 DESC; Another common one is: SELECT t.* FROM tbl t, ( SELECT soundex(city) AS cty, soundex(name) AS name FROM tbl GROUP BY 1, 2 HAVING COUNT(DISTINCT name) > 1) x WHERE soundex(t.city) = x.city AND soundex(t.name) = x.name; I.e. find all the entries with similar sounding cities and names where they have spelled their names differently. You can then check through and correct the entries where they really should be the same. What to do depends on how much data you have; a few thousand and you can do lots of fiddling by hand, whereas if you have a few tens of millions of people you want to try and do more with code. -- Sam http://samason.me.uk/ [1] http://www.postgresql.org/docs/current/static/fuzzystrmatch.html -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general