Re: limit clause breaks query planner?

"Pavel Stehule" <pavel.stehule@xxxxxxxxx> · Mon, 1 Sep 2008 21:17:17 +0200

Hello

2008/9/1 David West <david.west@xxxxxxxxxxxxx>:
> Thanks for your suggestion but the result is the same.
>
> Here is the explain analyse output from different queries.
> Select * from my_table where A is null and B = '21' limit 15
>
> "Limit  (cost=0.00..3.68 rows=15 width=128) (actual time=85837.043..85896.140 rows=15 loops=1)"
> "  ->  Seq Scan on my_table this_  (cost=0.00..258789.88 rows=1055580 width=128) (actual time=85837.038..85896.091 rows=15 loops=1)"
> "        Filter: ((A IS NULL) AND ((B)::text = '21'::text))"
> "Total runtime: 85896.214 ms"
>

I see it - problem is in statistics - system expect 1055580, but there
is only 15 values.

try
a) increase statistics on column a and b - probably there are strong
dependency between column a nad b, because statistic are totally out
b) try cursors
declare cursor c as Select * from my_table where A is null and B =
'21' limit 15;
fetch forward 15 from c;

http://www.postgresql.org/docs/8.2/static/sql-fetch.html

maybe

select * from (select * from mytable where b = '21' offset 0) where a
is null limit 15

regards
Pavel Stehule

> As you can see the estimated cost was 3.68: a long way from the true value.
>
> Doing 'set enable_seqscan=false' and repeating the select:
> "Limit  (cost=0.00..5.58 rows=15 width=128) (actual time=4426.438..4426.834 rows=15 loops=1)"
> "  ->  Index Scan using idx_A on my_table this_  (cost=0.00..392956.76 rows=1055970 width=128) (actual time=4426.433..4426.768 rows=15 loops=1)"
> "        Index Cond: (A IS NULL)"
> "        Filter: ((B)::text = '21'::text)"
> "Total runtime: 4426.910 ms"
>
> Probably some caching made this query faster, but it's still too slow, and using the wrong index.
>
> Deleting index A gives:
> "Limit  (cost=0.00..56.47 rows=15 width=128) (actual time=10.298..10.668 rows=15 loops=1)"
> "  ->  Index Scan using idx_B on my_table this_  (cost=0.00..3982709.15 rows=1057960 width=128) (actual time=10.293..10.618 rows=15 loops=1)"
> "        Index Cond: ((B)::text = '21'::text)"
> "        Filter: (A IS NULL)"
> "Total runtime: 10.735 ms"
> Much better.  However I need index A for another query so I can't just delete it.
>
> Looking at the estimated cost, you can see why it's choosing the order that it is choosing, but it just doesn't seem to reflect reality at all.
>
> Now here's the result of the query, with both indexes in place and sequential scan enabled
> Select * from my_table where A is null and B = '21'
> "Bitmap Heap Scan on my_table this_  (cost=20412.89..199754.37 rows=1060529 width=128) (actual time=470.772..7432.062 rows=1020062 loops=1)"
> "  Recheck Cond: ((B)::text = '21'::text)"
> "  Filter: (A IS NULL)"
> "  ->  Bitmap Index Scan on idx_B (cost=0.00..20147.76 rows=1089958 width=0) (actual time=466.545..466.545 rows=1020084 loops=1)"
> "        Index Cond: ((B)::text = '21'::text)"
> "Total runtime: 8940.119 ms"
>
> In this case it goes for the correct index.  It appears that the query planner makes very simplistic assumptions when it comes to LIMIT?
>
> Thanks
> David
>
> -----Original Message-----
> From: Pavel Stehule [mailto:pavel.stehule@xxxxxxxxx]
> Sent: 01 September 2008 13:53
> To: David West
> Cc: pgsql-performance@xxxxxxxxxxxxxx
> Subject: Re: [PERFORM] limit clause breaks query planner?
>
> Hello
>
> you should partial index
>
> create index foo(b) on mytable where a is null;
>
> regards
> Pavel Stehule
>
> 2008/9/1 David West <david.west@xxxxxxxxxxxxx>:
>> Hi,
>>
>>
>>
>> I have a single table with about 10 million rows, and two indexes.  Index A
>> is on a column A with 95% null values.  Index B is on a column B with about
>> 10 values, ie. About a million rows of each value.
>>
>>
>>
>> When I do a simple query on the table (no joins) with the following
>> condition:
>>
>> A is null AND
>>
>> B = '21'
>>
>>
>>
>> it uses the correct index, index B.  However, when I add a limit clause of
>> 15, postgres decides to do a sequential scan :s.  Looking at the results
>> from explain:
>>
>>
>>
>> "Limit  (cost=0.00..3.69 rows=15 width=128)"
>>
>> "  ->  Seq Scan on my_table this_  (cost=0.00..252424.24 rows=1025157
>> width=128)"
>>
>> "        Filter: ((A IS NULL) AND ((B)::text = '21'::text))"
>>
>>
>>
>> It appears that postgres is (very incorrectly) assuming that it will only
>> have to retrieve 15 rows on a sequential scan, and gives a total cost of
>> 3.69.  In reality, it has to scan many rows forward until it finds the
>> correct value, yielding very poor performance for my table.
>>
>>
>>
>> If I disable sequential scan (set enable_seqscan=false) it then incorrectly
>> uses the index A that has 95% null values: it seems to incorrectly apply the
>> same logic again that it will only have to retrieve 15 rows with the limit
>> clause, and thinks that the index scan using A is faster than index scan B.
>>
>>
>>
>> Only by deleting the index on A and disabling sequential scan will it use
>> the correct index, which is of course by far the fastest.
>>
>>
>>
>> Is there an assumption in the planner that a limit of 15 will mean that
>> postgres will only have to read 15 rows?  If so is this a bad assumption?
>> If a particular query is faster without a limit, then surely it will also be
>> faster with the limit.
>>
>>
>>
>> Any workarounds for this?
>>
>>
>>
>> Thanks
>>
>> David
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>