Re: Bad query plan with high-cardinality column

Tom Lane <tgl@xxxxxxxxxxxxx> · Fri, 22 Feb 2013 15:33:36 -0500

Alexander Staubo <alex@xxxxxxxxxx> writes:
>    select comments.id from comments where
>      conversation_id = 3975979 order by created_at limit 13

> I'm at a loss how to explain why the planner thinks scanning a huge
> index that covers the entire table will ever beat scanning a small index
> that has 17% of the table's values.

The reason is that the LIMIT may stop the query before it's scanned all
of the index.  The planner estimates on the assumption that the desired
rows are roughly uniformly distributed within the created_at index, and
on that assumption, it looks like this query will stop fairly soon ...
but evidently, that's wrong.  On the other hand, it knows quite well
that the other plan will require pulling out 5000-some rows and then
sorting them before it can return anything, so that's not going to be
exactly instantaneous either.

In this example, I'll bet that conversation_id and created_at are pretty
strongly correlated, and that most or all of the rows with that specific
conversation_id are quite far down the created_at ordering, so that the
search through the index takes a long time to run.  OTOH, with another
conversation_id the same plan might run almost instantaneously.

If you're concerned mostly with this type of query then a 2-column index
on (conversation_id, created_at) would serve your purposes nicely.  You
could likely even dispense with the separate index on conversation_id
alone.

			regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance