Re: application of KNN code to US zipcode searches?

Mark Stosberg <mark@xxxxxxxxxxxxxxx> · Thu, 17 Feb 2011 10:20:58 -0500

> I thought the benefit of KNN was that you could retrieve the rows in
> distance order, so that a query for the closest 20 locations (for
> example) would be very fast.  I wouldn't have expected it to be
> helpful when you're selecting all the rows regardless of distance.

Kevin,

Thanks for the feedback. You are right that my "reduced test case"
wasn't a good approximation. I added a limit, to simulate finding the
100 zipcodes closest to 90210.

Below I compare 4 approaches to the same query:

1. Cube search
2. Earth Distance Search
3. Simple point distance (no index)
4. Simple point distance (KNN)

Now KNN benchmarks to be almost 100x faster! That's very promising.
Then there's only the issue that simple point distance is not expected
to be a good enough approximation of earth-distances. Perhaps that can
be solved by pre-computing coordinates based on the lat/long pairs....
much like the map projections used to present a curved surface on a flat
map? Given that's OK to be be a few miles off, it seems we have some
leeway here.

Recommendations?

    Mark

EXPLAIN ANALYZE
SELECT zipcode,
    cube_distance( '(-2513120.64361786, -4645511.0460328,
3575538.9507084)', zipcodes.earth_coords)/1609.344 AS radius
    FROM zipcodes ORDER BY radius LIMIT 100;

---------------------------------------------------------------
 Limit  (cost=2946.70..2946.95 rows=100 width=62) (actual
time=167.650..168.064 rows=100 loops=1)
   ->  Sort  (cost=2946.70..3050.40 rows=41483 width=62) (actual
time=167.644..167.829 rows=100 loops=1)
         Sort Key: ((cube_distance('(-2513120.64361786,
-4645511.0460328, 3575538.9507084)'::cube, earth_coords) /
1609.344::double precision))
         Sort Method: top-N heapsort  Memory: 20kB
         ->  Seq Scan on zipcodes  (cost=0.00..1361.24 rows=41483
width=62) (actual time=0.030..90.807 rows=41483 loops=1)
 Total runtime: 168.300 ms

############################################################3

-- Using Earthdistance
EXPLAIN ANALYZE SELECT zipcode,
    lon_lat <@> '(-118.412426,34.096629)' As radius
    FROM zipcodes
    ORDER BY lon_lat <@> '(-118.412426,34.096629)'
    LIMIT 100;

------------------------------------------------------------
 Limit  (cost=2842.99..2843.24 rows=100 width=22) (actual
time=187.995..188.451 rows=100 loops=1)
   ->  Sort  (cost=2842.99..2946.70 rows=41483 width=22) (actual
time=187.989..188.149 rows=100 loops=1)
         Sort Key: ((lon_lat <@> '(-118.412426,34.096629)'::point))
         Sort Method: top-N heapsort  Memory: 20kB
         ->  Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483
width=22) (actual time=0.033..108.203 rows=41483 loops=1)
 Total runtime: 188.660 ms

##########################################

Using simple point distance, but with no Gist Index:

EXPLAIN ANALYZE SELECT zipcode,
    lon_lat <-> '(-118.412426,34.096629)' As radius
    FROM zipcodes
    ORDER BY lon_lat <-> '(-118.412426,34.096629)'
    LIMIT 100;

--------------------------------------------------------
 Limit  (cost=2842.99..2843.24 rows=100 width=22) (actual
time=160.574..161.057 rows=100 loops=1)
   ->  Sort  (cost=2842.99..2946.70 rows=41483 width=22) (actual
time=160.568..160.691 rows=100 loops=1)
         Sort Key: ((lon_lat <-> '(-118.412426,34.096629)'::point))
         Sort Method: top-N heapsort  Memory: 20kB
         ->  Seq Scan on zipcodes  (cost=0.00..1257.54 rows=41483
width=22) (actual time=0.027..84.610 rows=41483 loops=1)
 Total runtime: 161.226 ms

#########################################

-- Using KNN-GIST index
EXPLAIN ANALYZE SELECT zipcode,
    lon_lat <-> '(-118.412426,34.096629)' As radius
    FROM zipcodes
    ORDER BY lon_lat <-> '(-118.412426,34.096629)'
    LIMIT 100;
------------------------------------------------------------------
 Limit  (cost=0.00..12.94 rows=100 width=22) (actual time=0.447..1.892
rows=100 loops=1)
   ->  Index Scan using zipcodes_knn on zipcodes  (cost=0.00..5365.93
rows=41483 width=22) (actual time=0.440..1.407 rows=100 loops=1)
         Order By: (lon_lat <-> '(-118.412426,34.096629)'::point)
 Total runtime: 2.198 ms

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance