optimizing a geo_distance() proximity query

Mark Stosberg <mark@xxxxxxxxxxxxxxx> · Sat, 03 Feb 2007 14:00:26 -0500

Hello,

I'm using geo_distance() from contrib/earthdistance would like to find a
way to spend up the geo distance calculation if possible. This is for a
proximity search: "Show me adoptable pets within 250 miles of this
zipcode".

I'm researched a number of approaches to this, but none seem as workable
as I would have hoped.

I read this claim [1] that Jobster used a lookup table of pre-calculated
distances between zipcodes...it took about 100 million rows.

1. http://bostonsteamer.livejournal.com/831325.html

I'd like to avoid that, but I think there's a sound concept in there: we
repeatedly making a complex calculation with the same inputs, and the
outputs are always the same.

The zipdy project [2] used some interesting approaches, both similar to
the large table idea. One variation involved a PL routine that would
look up the result in a cache table. If no result was found, it would
would compute the result and add it to the cache table. Besides
eventually creating millions of rows in the cache table, I tried this
technique and found it was much slower than using geo_distance() without
a cache. Another variation in the zipdy distribution just uses several
smaller cache tables, like one for "zipcodes 25 miles away" and
"zipcodes 50 miles away".  Equally unattractive.

2. http://www.cryptnet.net/fsp/zipdy/

I looked at doing the calculation outside of PostgreSQL, and passing in
the resulting list of zipcodes in an explicit IN() list. This seem
promising at first. Geo::Postalcode (Perl) could do the calculation in
5ms, which seemed to beat PostgreSQL. For a small proximity, I think
that combination might have performed better. However, some places have
close to 5,000 zipcodes within 250 files. I tried putting /that/
resulting list into an explicitly IN() clause, and it got much slower. :)

I did find there are some possible optimizations that can be made to the
Haversine algorithm itself. As this post pointed out [3], we could
pre-convert the lat/lon pair to radians, and also compute their sin()
and cos() values. However, the person suggesting this approach provided
no benchmarks to suggest it was worth it, and I have no evidence so far
that it matters either.

3.
http://www.voxclandestina.com/2006-09-01/optimizing-spatial-proximity-searches-in-sql/

What strikes me to consider at this point are a couple of options:

A. Explore a way add some memory caching or "memoizing" to
geo_distance() so it would hold on to frequently pre-computed values,
but without storing all the millions of possibilities.

B. Look at an alternate implementation. I suspect that given a small
enough radius and the relatively large size of zipcodes, a simpler
representation of the Earth's curvature could be used, with a sufficient
accuracy. Perhaps a cylinder, or even a flat projection... We currently
max out at 250 miles. ( I just discussed this option with my wife, the
math teacher. :)

Advice from other people who have deployed high-performance proximity
searches with PostgreSQL would be appreciated!

   Mark