Dan Ancona <da <at> vizbang.com> writes: > his is a choice between > developing some in-house capacity for this and sending people to > various vendors so we'll probably lean on the vendors for now, at > least while we work on it. I would try to do the record matching in house and see how far you get, even if you are talking to vendors concurrently. You might get lucky, and you will learn a lot about your data and how much to expect and pay for vendor solutions. I would: Try building multi column indices on both tables for what you think are the same rows, and match deterministically (if you have a key like social security, then do this again on full names). Examine your data to see what hits, what misses, what hits multiple. If you know there is a "good" and an "iffy" table, you can use a left outer, otherwise you need a full outer. Then put all your leftovers from each into new tables, and try again with something fuzzy. If you build the indices and use "=" and it is still slow, ask again here -- that shouldn't happen. > And you're right fork, Record Linkage is in fact an entire academic > discipline! Indeed. Look for "blocking" and "editing" with your data first, I think. I find this problem pretty interesting, so I would love to hear your results. I am right now matching building permits to assessor parcels.... I wish I was using PG ... -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance