On Thu, Sep 16, 2010 at 03:22:15PM +0200, Andreas wrote: > We are talking about nearly 500.000 records with considerable overlapping. Other things to consider is whether each one contains unique entries and hence can you do a "best match" between datasets--FULL OUTER JOIN is your friend here, but duplicates become a problem. > It's not only typos to catch. There is variation in the way to write > things that not necessarily are wrong. > e.g. > Miller's Bakery > Bakery Miller > Bakery Miller, Ltd. > Bakery Miller and sons > Bakery Smith (formerly Miller) Soundex is tolerant to quite a lot of this, but word order is important. When I've had to do this before ~360k merging with ~80k addresses I've gone with normalised postcodes (in the UK postcodes contain a nice mix of letters and numbers meaning that I can be reasonable sure about typos) and then gone through a reasonable chunk by hand to make sure things are working "correctly". Just thought; depending on your spacial sparsity, you may be able to get away with trusting the zip code and checking when the soundex of the name is different. > and the usual > Strawberry Street > Strawberrystreet > Strawberry Str.42 > Strawberry Str. 42 > Strawberry Str. 42-45 Soundex gets those all the same (and even '42-45 Strawberry Str'), so that's easy. In fact it completely ignores the numbers so you'll have to do something specific about them. -- Sam http://samason.me.uk/ -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general