Re: What is the right way to deal with a table with rows that are not in a random order?

Jeff Davis <pgsql@xxxxxxxxxxx> · Wed, 27 May 2009 17:54:27 -0700

On Wed, 2009-05-27 at 19:53 -0400, Douglas Alan wrote:
> We have a very large table (150 million rows) where the rows are not
> in a random order.  Some common queries will have millions of results,
> and this of course is slow.  For an interactive interface to the
> database, we want to put a limit on all queries so that queries will
> return quickly even if there are millions of results. 
> 
> The problem we are seeing at the moment is that the statistics
> histograms are actually often hurting us.  Postgres sees that for some
> queries there are so many results that it decides to do a sequential
> scan, rather than using the index.  Unfortunately, sometimes all of
> these millions of results are at the end of the table, rather than
> being randomly distributed in the table, so a sequential scan is the
> worst possible approach. 

If you're putting a LIMIT on it, why does it return millions of results?

> To fix this, we could use an "order by" clause to force Postgres to
> use the index, but this isn't so easy as we are using Postgres through
> an ORM (i.e, Django in this case), and it's not so easy to try to
> jury-rig  things this way on a per-query basis. 
> 

ORDER BY won't always choose an index scan, for instance if the
correlation is low and the expected results are many.

Can you pick out an interesting query and give some specifics, like:
* the query
* the EXPLAIN ANALYZE output (or EXPLAIN without ANALYZE if it takes too
long to even run once)
* EXPLAIN ANALYZE output if you force the index scan
* the statistics for the relevant columns, such as histogram and
correlation

Regards,
	Jeff Davis

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general