Search Postgresql Archives

Re: Need magic for identifieing double adresses

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 16 Sep 2010 06:22:15 -0700, Andreas <maps.on@xxxxxxx> wrote:

It's not only typos to catch. There is variation in the way to write things that not necessarily are wrong.
e.g.
Miller's Bakery
Bakery Miller
Bakery Miller, Ltd.
Bakery Miller and sons
Bakery Smith (formerly Miller)

and the usual
Strawberry Street
Strawberrystreet
Strawberry Str.42
Strawberry Str. 42
Strawberry Str. 42-45

If this is a one-time procedure, I'd definitely go manually. The key is to quickly bind records and find the "remaining" ones.

I'd create a lookup table and bind all similar values to a single value.

I would also take each word in the field, turn it to lower case, remove punctuation signs and enter it in another table (original_word varchar, normalized_word varchar). I would then search for the most popular normalized_word, hoping that would throw me back keywords like "strawberry" and "miller". I would then search for those to continue creating the look up table.

You might want to write an interface to let you drag all the DISTINCT keywords and drop them to the "single" value.

I have never seen it, though. :)

Good luck.

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux