Steve Atkins <steve <at> blighty.com> writes: > > > On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote: > > > Hi postgressers - > > > > As part of my work with voter file data, I pretty regularly have to join one large-ish (over 500k rows) table > to another. Sometimes this is via a text field (countyname) + integer (voter id). I've noticed sometimes > this converges and sometimes it doesn't, seemingly regardless of how I index things. By "converge" you mean "finish running" -- "converge" has a lot of other overtones for us amateur math types. Note that I think you are doing "record linkage" which is a stepchild academic of its own these days. It might bear some research. THere is also a CDC matching program for text files freely downloadalbe to windows (ack), if you hunt for it. For now, my first thought is that you should try a few different matches, maybe via PL/PGSQL functions, cascading the non-hits to the next step in the process while shrinking your tables. upcase and delete all spaces, etc. First use equality on all columns, which should be able to use indices, and separate those records. Then try equality on a few columns. Then try some super fuzzy regexes on a few columns. Etc. You will also have to give some thought to scoring a match, with perfection a one, but, say, name and birthday the same with all else different a .75, etc. Also, soundex(), levenshtein, and other fuzzy string tools are your friend. I want to write a version of SAS's COMPGED for Postgres, but I haven't got round to it yet. -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance