Re: application of KNN code to US zipcode searches?

Mark Stosberg <mark@xxxxxxxxxxxxxxx> · Sat, 29 Oct 2011 10:45:22 -0400

Hello,

I want to report that I have now solved the challenges I ran into using
KNN for US zipcode searching. I've found the new approach to not only be
viable, but to benchmark about 3x faster for our own real-world
application than the previous approach we used, involving
cube_distance() and earth_box().

Here's some details about my research so far.

To evaluate it, I installed PostgreSQL 9.1 and a current PostGIS 2.0
snapshot (not yet released as stable).

A primary challenge I had to solve was that KNN is designed for a
slightly different problem than what I needed to solve. I need to answer
the question:

 "What are all the objects that are in zipcodes with 50 miles of a given
zipcode?"

However, KNN only directly provides a performance boost to this
variation:

 "What are the N nearest objects to this point?"

Just adding a "WHERE clause" to check the 50 mile rule would erase the
benefits of KNN, which works through an "ORDER BY" clause.

I solved my issue by using a "WITH" clause that creates a pseudo-table
called "nearby_zipcodes". In this example, I select all the zipcodes
that are within 50 miles of the "47374" zipcode. The trick I've
employed is that I've set the LIMIT value to 286-- exactly the number of
zipcodes within 50 miles of 47374. My plan is to add another column to
my "zipcodes" table for each of the small number distances I need to
search. Then, when I load new zipcodes I can pre-compute how many
zipcodes would be found at this distance.

This have approach would not have worked without a "WITH" clause, or
some equivalent, because the number of objects within the radius is not
known, but the number of nearby zipcodes is fixed.

This approach allows me to get the performance benefits of KNN, while
also returning exactly those objects within 50 miles of my
target zipcode, by JOINing on the "nearby_zipcodes" table:

 WITH nearby_zipcodes AS (
     SELECT zipcode,
         st_distance_sphere(lonlat_point, (SELECT lonlat_point from
zipcodes WHERE zipcode = '47374')) / 1609.344 as radius
     FROM zipcodes
     ORDER BY lonlat_point  <-> (SELECT lonlat_point from zipcodes WHERE
zipcode = '47374')
     LIMIT 286
 )
 SELECT ...

You might also notice that "st_distance_sphere()" doesn't mean exactly
the same thing as the "<->" operator. That's something I could refine
going forward.

That's what I've got so far. How could I improve this further?

For reference, here are the key parts of the "zipcodes" table:

# \d zipcodes
             Table "public.zipcodes"
    Column    |         Type          | Modifiers
--------------+-----------------------+-----------
 zipcode      | character varying(5)  | not null
 lonlat_point | geometry(Point,4326)  |
Indexes:
    "zipcodes_pkey" PRIMARY KEY, btree (zipcode)
    "lonlat_point_idx" gist (lonlat_point)

Thanks for the peer review!

        Mark Stosberg

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance