Re: Simple query, 10 million records...MySQL ten times faster

Scott Marlowe <smarlowe@xxxxxxxxxxxxxxxxx> · Thu, 26 Apr 2007 17:42:26 -0500

On Tue, 2007-04-24 at 16:26, zardozrocks wrote:
> I have this table:
> 
> CREATE TABLE test_zip_assoc (
>     id serial NOT NULL,
>     f_id integer DEFAULT 0 NOT NULL,
>     lat_radians numeric(6,5) DEFAULT 0.00000 NOT NULL,
>     long_radians numeric(6,5) DEFAULT 0.00000 NOT NULL
> );

Like someone else mentioned numeric types are SLOW.  See if you can use
integers, or at least floats.

I also wonder if you might be better served with geometric types and
GiST indexes on them than using your lat / long grid.  With a geometric
type, you could define the lat / long as a point and use geometric
operations with it.

See the pgsql manual:

http://www.postgresql.org/docs/8.1/static/datatype-geometric.html
http://www.postgresql.org/docs/8.1/static/functions-geometry.html

> It's basically a table that associates some foreign_key (for an event,
> for instance) with a particular location using longitude and
> latitude.  I'm basically doing a simple proximity search.  I have
> populated the database with *10 million* records.  I then test
> performance by picking 50 zip codes at random and finding the records
> within 50 miles with a query like this:

I assume that there aren't 10 million zip codes, right?

Are you storing the lat / long of the individual venues?  Or the zip
codes?  If you're storing the lat / long of the zips, then I can't
imagine there are 10 million zip codes.  If you could use the lat / long
numbers to find the zip codes that are in your range, then join that to
a venue table that fks off of the zip code table, I would think it would
be much faster, as you'd have a smaller data set to trundle through.

> SELECT id
> 	FROM test_zip_assoc
> 	WHERE
> 		lat_radians > 0.69014816041
> 		AND lat_radians < 0.71538026567
> 		AND long_radians > -1.35446228028
> 		AND long_radians < -1.32923017502
> 
> 
> On my development server (dual proc/dual core Opteron 2.8 Ghz with 4GB
> ram) this query averages 1.5 seconds each time it runs after a brief
> warmup period.  In PostGreSQL it averages about 15 seconds.

I wonder how well it would run if you had 10, 20, 30, 40 etc... users
running it at the same time.  My guess is that you'll be very lucky to
get anything close to linear scaling in any database.  That's because
this is CPU / Memory bandwidth intensive, so it's gonna kill your DB. 
OTOH, if it was I/O bound you could throw more hardware at it (bb cache
RAID controller, etc)

> Both of those times are too slow.  I need the query to run in under a
> second with as many as a billion records.  I don't know if this is
> possible but I'm really hoping someone can help me restructure my
> indexes (multicolumn?, multiple indexes with a 'where' clause?) so
> that I can get this running as fast as possible.

You're trying to do a whole lot of processing in a little time.  You're
either gonna have to accept a less exact answer (i.e. base it on zip
codes) or come up with some way of mining the data for the answers ahead
of time, kind of like a full text search for lat and long.

So, have you tried what I suggested about increasing shared_buffers and
work_mem yet?