Search Postgresql Archives

Re: Perfomance of IN-clause with many elements and possible solutions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"David G. Johnston" <david.g.johnston@xxxxxxxxx> writes:
> On Mon, Jul 24, 2017 at 3:22 PM, Dmitry Lazurkin <dilaz03@xxxxxxxxx> wrote:
>> ALTER TABLE ids ALTER COLUMN id SET NOT NULL;
>> EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN
>> :values_clause;
>> 
>> Aggregate  (cost=245006.46..245006.47 rows=1 width=8) (actual
>> time=3824.095..3824.095 rows=1 loops=1)
>> Buffers: shared hit=44248
>> ->  Hash Join  (cost=7.50..235006.42 rows=4000019 width=0) (actual
>> time=1.108..3327.112 rows=3998646 loops=1)
>> ...

> ​You haven't constrained the outer relation (i.e., :values_clause) to be
> non-null which is what I believe is required for the semi-join algorithm to
> be considered.​

No, the planner is thinking about semi-join, it just decides it prefers
to de-dup and then do a plain join.  I believe this is mainly because it
lacks statistics about the inner relation and is conservative about what
it assumes about the number of duplicates in the absence of stats.
But you can force it.  Taking the original example (and being sure to
have ANALYZE'd ids):

regression=# EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN
:values_clause;
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=245006.16..245006.17 rows=1 width=8) (actual time=3550.581..3550.581 rows=1 loops=1)
   Buffers: shared hit=2208 read=42040
   ->  Hash Join  (cost=7.50..235006.13 rows=4000013 width=0) (actual time=0.494..3093.100 rows=4002875 loops=1)
         Hash Cond: (ids.id = "*VALUES*".column1)
         Buffers: shared hit=2208 read=42040
         ->  Seq Scan on ids  (cost=0.00..144248.33 rows=10000033 width=8) (actual time=0.071..1118.278 rows=10000000 loops=1)
               Buffers: shared hit=2208 read=42040
         ->  Hash  (cost=5.00..5.00 rows=200 width=4) (actual time=0.404..0.404 rows=200 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 16kB
               ->  HashAggregate  (cost=3.00..5.00 rows=200 width=4) (actual time=0.267..0.332 rows=200 loops=1)
                     Group Key: "*VALUES*".column1
                     ->  Values Scan on "*VALUES*"  (cost=0.00..2.50 rows=200 width=4) (actual time=0.003..0.134 rows=200 loops=1)
 Planning time: 0.561 ms
 Execution time: 3550.700 ms
(14 rows)

regression=# set enable_hashagg TO 0;
SET
regression=# EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN
:values_clause;
                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=245012.31..245012.32 rows=1 width=8) (actual time=3553.194..3553.194 rows=1 loops=1)
   Buffers: shared hit=2240 read=42008
   ->  Hash Join  (cost=13.64..235012.28 rows=4000013 width=0) (actual time=0.545..3093.434 rows=4002875 loops=1)
         Hash Cond: (ids.id = "*VALUES*".column1)
         Buffers: shared hit=2240 read=42008
         ->  Seq Scan on ids  (cost=0.00..144248.33 rows=10000033 width=8) (actual time=0.072..1118.853 rows=10000000 loops=1)
               Buffers: shared hit=2240 read=42008
         ->  Hash  (cost=11.14..11.14 rows=200 width=4) (actual time=0.452..0.452 rows=200 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 16kB
               ->  Unique  (cost=10.14..11.14 rows=200 width=4) (actual time=0.227..0.384 rows=200 loops=1)
                     ->  Sort  (cost=10.14..10.64 rows=200 width=4) (actual time=0.226..0.276 rows=200 loops=1)
                           Sort Key: "*VALUES*".column1
                           Sort Method: quicksort  Memory: 35kB
                           ->  Values Scan on "*VALUES*"  (cost=0.00..2.50 rows=200 width=4) (actual time=0.003..0.134 rows=200 loops=1)
 Planning time: 0.567 ms
 Execution time: 3553.297 ms
(16 rows)

regression=# set enable_sort TO 0;
SET
regression=# EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN
:values_clause;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=320003.90..320003.91 rows=1 width=8) (actual time=3548.364..3548.364 rows=1 loops=1)
   Buffers: shared hit=2272 read=41976
   ->  Hash Semi Join  (cost=5.00..310003.87 rows=4000013 width=0) (actual time=0.331..3091.235 rows=4002875 loops=1)
         Hash Cond: (ids.id = "*VALUES*".column1)
         Buffers: shared hit=2272 read=41976
         ->  Seq Scan on ids  (cost=0.00..144248.33 rows=10000033 width=8) (actual time=0.071..1117.761 rows=10000000 loops=1)
               Buffers: shared hit=2272 read=41976
         ->  Hash  (cost=2.50..2.50 rows=200 width=4) (actual time=0.236..0.236 rows=200 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 16kB
               ->  Values Scan on "*VALUES*"  (cost=0.00..2.50 rows=200 width=4) (actual time=0.003..0.142 rows=200 loops=1)
 Planning time: 0.545 ms
 Execution time: 3548.463 ms
(12 rows)

The cost to form the inner hash is basically negligible whether it's
de-duped or not, but if it's not (known) de-duped then the cost
estimate for the semijoin is going to rise some, and that discourages
selecting it.

At least in this example, the actual runtimes are basically identical
regardless, so there is no great point in sweating over it.

			regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux