Re: Index is not used for "IN (non-correlated subquery)"

George <pinkisntwell@xxxxxxxxx> · Wed, 30 Nov 2016 19:05:18 +0200

On Wed, Nov 30, 2016 at 6:45 PM, Merlin Moncure <mmoncure@xxxxxxxxx> wrote:
> On Wed, Nov 30, 2016 at 10:42 AM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> George <pinkisntwell@xxxxxxxxx> writes:
>>> explain select * from wg3ppbm_transaction where partner_uuid in (
>>>             select p.uuid
>>>             from wg3ppbm_userpartner up
>>>             join wg3ppbm_partner p on p.id = up.partner_id
>>> );
>>
>>> "Hash Semi Join  (cost=2.07..425.72 rows=2960 width=482)"
>>> "  Hash Cond: ((wg3ppbm_transaction.partner_uuid)::text = (p.uuid)::text)"
>>> "  ->  Seq Scan on wg3ppbm_transaction  (cost=0.00..375.19 rows=5919 width=482)"
>>> "  ->  Hash  (cost=2.06..2.06 rows=1 width=37)"
>>> "        ->  Nested Loop  (cost=0.00..2.06 rows=1 width=37)"
>>> "              Join Filter: (up.partner_id = p.id)"
>>> "              ->  Seq Scan on wg3ppbm_userpartner up
>>> (cost=0.00..1.01 rows=1 width=4)"
>>> "              ->  Seq Scan on wg3ppbm_partner p  (cost=0.00..1.02
>>> rows=2 width=41)"
>>
>> This plan is expecting to have to return about half of the rows in
>> wg3ppbm_transaction, a situation for which an indexscan would NOT
>> be a better choice.  The usual rule of thumb is that you need to be
>> retrieving at most one or two percent of a table's rows for an indexscan
>> on it to be faster than a seqscan.
>>
>> I think however that the "half" may be a default estimate occasioned
>> by the other tables being empty and therefore not having any statistics.
>> Another rule of thumb is that the plans you get for tiny tables have
>> little to do with what happens once there's lots of data.
>
> Yeah, don't make query plan assumptions against empty or nearly empty
> tables.  As the data grows, the plans will suitably change.  Perhaps
> OP just recently loaded a bunch of data and the tables haven't been
> analyzed yet?

I just added a significant number of rows to the table. I now have
1.3M rows in total but only 8K rows that contain the value I am
seeking. I also ran ANALYZE after loading the data. The query plans
for the two queries did not change. Also, the simple query returns in
45 ms while the one with the subquery needs 1.5 s, i.e. it is about
30x slower.

So there is definitely something wrong here. This situation makes many
row-level security use cases cumbersome since you need to have
almost the same WHERE clause both in the row-level security policy and
in every SELECT query in order for the index to be used.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general