Re: Finding duplicate contacts..

Martin Norland <martin.norland@xxxxxxxxxx> · Tue, 26 Jul 2005 14:07:29 -0500

Well, if you want to be as certain as you can - you're going to need to 
stick to a fairly rigorous check.  What you could do is export the DB 
into something PHP can quickly load right in (something serialized, for 
e.g.) every X minutes, and then at the end of the day run your 'rigorous 
 test'.  One thing that (might) spare you, is once you've checked a set 
against itself, you shouldn't need to check it again [depends on how 
changes are handled].  If you track the records that are changed updated 
or added, and every few hours (or nightly) check back over that - you'll 
have a fairly fast performing system that has few likely dupes on the whole.

Now, in my opinion you might want to stick with the idea of hitting the 
DB directly (depends) - but toss in some 'real world sanity checking'. 
Without seeing the query, I can't say if you haven't already - but 
chances are you can do quicker passes frequently, and occasionally do 
the slower more thorough checks.  Most peoples last names don't change 
often - and when they do, they could easily also change phone number and 
address.  Toss in 'soundex' to handle typos if doing it at the DB level, 
or you can use levenshtein distance [levenshtein()] to catch the 
occasional typo if hitting the info in PHP.

Another thing to note - if you know you're going to have duplicates no 
matter what your efforts, you might just want to automatically accept 
new records and let the system mark anything that looks dupe for a human 
to check over, saving you all that checking on the entry end of things 
(which I guess are done in batches, PDA sync or some such?)

If you realize there will be no fast perfect solution, you can start 
getting creative with a couple different levels and find a pretty happy 
balance.

cheers,
--
- Martin Norland, Sys Admin / Database / Web Developer, International 
Outreach x3257
The opinion(s) contained within this email do not necessarily represent 
those of St. Jude Children's Research Hospital.

tg-php@xxxxxxxxxxxxxxxxxxxxxx wrote:
We have a table with contacts and we need to determine if someone's already in there but with possibly slightly different info or if the old data contains duplicates.  Current criteria (I'm sure there's better):

If home phone matches any existing home, work or alt phones.
If work phone matches any existing home, work or alt phones.
If alt phone matches any existing home, work or alt phones.
If a concat of last name, street address and zip match any existing.

Table currently contains about 60,000 entries (imported from old system) and we can expect it to grow.

Indexes are on home, work and alt phones, as well as last name, street address and zip.  Tried it without indexes, it was slow.. tried it with indexes.. still about the same speed.

Tried seeing if putting the dupe check into a prepared statement sped it up any.  It didn't.

Analyzed table.  Same speed.

Out of 60,000 entries, we have roughly 40,000+ unique phone numbers (that's doing a union on all three phone number columns).  So even pre-compiling just that list and querying another table is going to be similar in time issues it seems.

For new contacts being brought in, we might have 30-60 at a time (maybe more) so this query would end up being run on each of the 30-60.  At 1.5-2 sec for each query, that ends up being a full minute or two to do the whole scan.

If I pull the phone #'s and name/address/zip combies into an array in PHP first, it's only like 12 secs, but each user would end up pulling it (and maybe store in a session variable or something) and re-pull it periodically to make sure it's semi-fresh.  This sounds like the quicker, but not the "best" solution.

Any tricks, tips or philosophical and/or theoretical things I've just not seeing, please let me know.

--
PHP Database Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php