Re: Select in subselect vs select = any array

Pavel Stehule <pavel.stehule@xxxxxxxxx> · Mon, 21 Mar 2011 07:38:31 +0100

2011/3/21 Adam Tistler <atistler@xxxxxxxxx>:
> Pavel, thanks for the help.
>
> I increased work_mem from 16MB to 64MB, no difference. ÂThe queries are really just a test case. ÂMy actual queries are actual just large number of primary keys that I am selecting from the db:
>
> For example:
> Â select * from nodes where node_id in ( 1, 2, 3 ..... )
>
> I found that even for small queries, the following is faster:
> Â select * from nodes where node_in = any (array[1,2,3 .... ])

it depends on version. I think so on last postgres, these queries are
same, not sure.

Regards

Pavel

>
>
> Its not really a big deal to me, I was just wondering if others could reproduce it on other systems/versions and if perhaps this is an issue that I should point out to postgres-dev.
>
>
> Results below:
>
> logicops2=# explain analyze select count(*) from nodes where node_id in ( select node_id from nodes limit 100000 );
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------------------
> ÂAggregate Â(cost=3017.18..3017.19 rows=1 width=0) (actual time=1017.051..1017.051 rows=1 loops=1)
> Â -> ÂNested Loop Â(cost=2887.05..3016.68 rows=200 width=0) (actual time=157.290..986.329 rows=100000 loops=1)
> Â Â Â Â -> ÂHashAggregate Â(cost=2887.05..2889.05 rows=200 width=4) (actual time=157.252..241.995 rows=100000 loops=1)
> Â Â Â Â Â Â Â -> ÂLimit Â(cost=0.00..1637.05 rows=100000 width=4) (actual time=0.009..73.942 rows=100000 loops=1)
> Â Â Â Â Â Â Â Â Â Â -> ÂSeq Scan on nodes Â(cost=0.00..12355.34 rows=754734 width=4) (actual time=0.008..35.428 rows=100000 loops=1)
> Â Â Â Â -> ÂIndex Scan using n_node_id_index on nodes Â(cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=100000)
> Â Â Â Â Â Â Â Index Cond: (public.nodes.node_id = public.nodes.node_id)
> ÂTotal runtime: 1017.794 ms
> (8 rows)
>
> logicops2=# explain analyze select count(*) from nodes where node_id = any(array ( select node_id from nodes limit 100000 ));
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â QUERY PLAN
> -----------------------------------------------------------------------------------------------------------------------------------------
> ÂAggregate Â(cost=1718.60..1718.61 rows=1 width=0) (actual time=485.554..485.555 rows=1 loops=1)
> Â InitPlan 1 (returns $0)
> Â Â -> ÂLimit Â(cost=0.00..1637.05 rows=100000 width=4) (actual time=0.011..73.037 rows=100000 loops=1)
> Â Â Â Â Â -> ÂSeq Scan on nodes Â(cost=0.00..12355.34 rows=754734 width=4) (actual time=0.010..34.462 rows=100000 loops=1)
> Â -> ÂBitmap Heap Scan on nodes Â(cost=42.67..81.53 rows=10 width=0) (actual time=433.003..461.108 rows=100000 loops=1)
> Â Â Â Â Recheck Cond: (node_id = ANY ($0))
> Â Â Â Â -> ÂBitmap Index Scan on n_node_id_index Â(cost=0.00..42.67 rows=10 width=0) (actual time=432.810..432.810 rows=100000 loops=1)
> Â Â Â Â Â Â Â Index Cond: (node_id = ANY ($0))
> ÂTotal runtime: 485.638 ms
> (9 rows)
>
> On Mar 21, 2011, at 1:54 AM, Pavel Stehule wrote:
>
>> Hello
>>
>> I think so HashAggregate goes out of memory - you can try to increase
>> a work_mem.
>>
>> There are better queries for counting duplicit then cross join
>>
>> Regards
>>
>> Pavel Stehule
>>
>> 2011/3/21 Adam Tistler <atistler@xxxxxxxxx>:
>>> logicops2=# explain analyze select count(*) from nodes where node_id = any( ÂArray(select node_id from nodes limit 100000) );
>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â QUERY PLAN
>>> -----------------------------------------------------------------------------------------------------------------------------------------
>>> ÂAggregate Â(cost=1718.59..1718.60 rows=1 width=0) (actual time=509.126..509.127 rows=1 loops=1)
>>> Â InitPlan 1 (returns $0)
>>> Â Â -> ÂLimit Â(cost=0.00..1637.04 rows=100000 width=4) (actual time=0.010..76.604 rows=100000 loops=1)
>>> Â Â Â Â Â -> ÂSeq Scan on nodes Â(cost=0.00..12355.41 rows=754741 width=4) (actual time=0.008..38.105 rows=100000 loops=1)
>>> Â -> ÂBitmap Heap Scan on nodes Â(cost=42.67..81.53 rows=10 width=0) (actual time=447.274..484.283 rows=100000 loops=1)
>>> Â Â Â Â Recheck Cond: (node_id = ANY ($0))
>>> Â Â Â Â -> ÂBitmap Index Scan on n_node_id_index Â(cost=0.00..42.67 rows=10 width=0) (actual time=447.074..447.074 rows=100000 loops=1)
>>> Â Â Â Â Â Â Â Index Cond: (node_id = ANY ($0))
>>> ÂTotal runtime: 509.209 ms
>>> (9 rows)
>>>
>>> Time: 510.009 ms
>>>
>>>
>>> logicops2=# explain analyze select count(*) from nodes where node_id in (select node_id from nodes limit 100000);
>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â QUERY PLAN
>>> ----------------------------------------------------------------------------------------------------------------------------------------
>>> ÂAggregate Â(cost=3017.17..3017.18 rows=1 width=0) (actual time=1052.866..1052.866 rows=1 loops=1)
>>> Â -> ÂNested Loop Â(cost=2887.04..3016.67 rows=200 width=0) (actual time=167.310..1021.540 rows=100000 loops=1)
>>> Â Â Â Â -> ÂHashAggregate Â(cost=2887.04..2889.04 rows=200 width=4) (actual time=167.198..251.205 rows=100000 loops=1)
>>> Â Â Â Â Â Â Â -> ÂLimit Â(cost=0.00..1637.04 rows=100000 width=4) (actual time=0.008..80.090 rows=100000 loops=1)
>>> Â Â Â Â Â Â Â Â Â Â -> ÂSeq Scan on nodes Â(cost=0.00..12355.41 rows=754741 width=4) (actual time=0.007..41.566 rows=100000 loops=1)
>>> Â Â Â Â -> ÂIndex Scan using n_node_id_index on nodes Â(cost=0.00..0.63 rows=1 width=4) (actual time=0.006..0.007 rows=1 loops=100000)
>>> Â Â Â Â Â Â Â Index Cond: (public.nodes.node_id = public.nodes.node_id)
>>> ÂTotal runtime: 1053.523 ms
>>> (8 rows)
>>>
>>> Time: 1054.864 ms
>>>
>>>
>>>
>>> On Mar 20, 2011, at 2:51 AM, Pavel Stehule wrote:
>>>
>>>> Hello
>>>>
>>>> 2011/3/20 Adam Tistler <atistler@xxxxxxxxx>:
>>>>> I have noticed that SELECT ... = ANY(ARRAY(...)) Âis about twice as fast as SELECT IN ( ... ).
>>>>> Can anyone explain a reason for this? ÂResults are the bottom and are reproducible. ÂI can test with other versions if that is necessary.
>>>>>
>>>>
>>>> send a result of EXPLAIN ANALYZE SELECT ..., please
>>>>
>>>> The reasons can be different - less seq scans, indexes
>>>>
>>>> Regards
>>>>
>>>> Pavel Stehule
>>>>
>>>>
>>>>
>>>>> ./configure --prefix=/usr/local/pgsql84 --with-openssl --with-perl
>>>>> CentOS release 5.4 (Final)
>>>>> psql (PostgreSQL) 8.4.1
>>>>>
>>>>> prompt2=# select count(*) from nodes;
>>>>> Âcount
>>>>> --------
>>>>> Â754734
>>>>> (1 row)
>>>>>
>>>>>
>>>>> prompt2=# \d nodes
>>>>> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ÂTable "public.nodes"
>>>>> Â ÂColumn Â Â| Â Â Â Â Â Type Â Â Â Â Â | Â Â Â Â Â Â Â Â Â Â Â Â Modifiers
>>>>> --------------+--------------------------+-----------------------------------------------------------
>>>>> Ânode_id Â Â Â| integer Â Â Â Â Â Â Â Â Â| not null default nextval(('node_id_seq'::text)::regclass)
>>>>> Ânode_type_id | integer Â Â Â Â Â Â Â Â Â| not null
>>>>> Âtemplate_id Â| integer Â Â Â Â Â Â Â Â Â| not null
>>>>> Âtimestamp Â Â| timestamp with time zone | default ('now'::text)::timestamp(6) with time zone
>>>>> Indexes:
>>>>> Â Â"nodes_pkey" PRIMARY KEY, btree (node_id)
>>>>> Â Â"n_node_id_index" btree (node_id)
>>>>> Â Â"n_node_type_id_index" btree (node_type_id)
>>>>> Â Â"n_template_id_index" btree (template_id)
>>>>>
>>>>> prompt2=# select count(*) from nodes where node_id = any( ÂArray(select node_id from nodes limit 100000) );
>>>>> Âcount
>>>>> --------
>>>>> Â100000
>>>>> (1 row)
>>>>>
>>>>> Time: 404.530 ms
>>>>> prompt2=# select count(*) from nodes where node_id = any( ÂArray(select node_id from nodes limit 100000) );
>>>>> Âcount
>>>>> --------
>>>>> Â100000
>>>>> (1 row)
>>>>>
>>>>> Time: 407.316 ms
>>>>> prompt2=# select count(*) from nodes where node_id = any( ÂArray(select node_id from nodes limit 100000) );
>>>>> Âcount
>>>>> --------
>>>>> Â100000
>>>>> (1 row)
>>>>>
>>>>> Time: 408.728 ms
>>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>>> Âcount
>>>>> --------
>>>>> Â100000
>>>>> (1 row)
>>>>>
>>>>> Time: 793.840 ms
>>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>>> Âcount
>>>>> --------
>>>>> Â100000
>>>>> (1 row)
>>>>>
>>>>> Time: 779.137 ms
>>>>> prompt2=# select count(*) from nodes where node_id in (select node_id from nodes limit 100000 );
>>>>> Âcount
>>>>> --------
>>>>> Â100000
>>>>> (1 row)
>>>>>
>>>>> Time: 781.820 ms
>>>>>
>>>>>
>>>>> --
>>>>> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
>>>>> To make changes to your subscription:
>>>>> http://www.postgresql.org/mailpref/pgsql-performance
>>>>>
>>>
>>>
>
>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance