Hello.
I have database with events with type from different souces identified by id. I have query which filters events by IN-clause with many ids (1-500 ids). I see poor perfomance of IN-clause and try to investigate this problem.
SELECT version();
version
-------------------------------------------------------------------------------------------------------------------
PostgreSQL 10beta2 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit
-- Full table can fit in memory
show shared_buffers;
shared_buffers
----------------
2GB
show work_mem;
work_mem
----------
16MB
SET max_parallel_workers_per_gather TO 0;
SET max_parallel_workers TO 0;
-- Create table with 10 000 000 rows with 500 bigints
CREATE TABLE ids AS SELECT trunc(random() * 500)::bigint as id from generate_series(1, 10000000);
-- IN (...)
SELECT ('(' || string_agg(id::text, ',') || ')') AS in_clause
FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset
EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN :in_clause;
Aggregate (cost=2654265.02..2654265.03 rows=1 width=8) (actual time=17268.831..17268.831 rows=1 loops=1)
Buffers: shared hit=44248
-> Seq Scan on ids (cost=0.00..2644260.48 rows=4001815 width=0) (actual time=0.066..16722.072 rows=3998646 loops=1)
Filter: (id = ANY ('{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199}'::bigint[]))
Rows Removed by Filter: 6001354
Buffers: shared hit=44248
Planning time: 3.324 ms
Execution time: 17268.907 ms
-- IN (VALUES ...)
SELECT ('(VALUES ' || string_agg('(' || id::text || ')', ',') || ')') AS values_clause
FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset
EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE id IN :values_clause;
Aggregate (cost=245006.46..245006.47 rows=1 width=8) (actual time=4086.188..4086.188 rows=1 loops=1)
Buffers: shared hit=44248
-> Hash Join (cost=7.50..235006.42 rows=4000019 width=0) (actual time=0.978..3557.037 rows=3998646 loops=1)
Hash Cond: (ids.id = "*VALUES*".column1)
Buffers: shared hit=44248
-> Seq Scan on ids (cost=0.00..144248.48 rows=10000048 width=8) (actual time=0.031..1138.542 rows=10000000 loops=1)
Buffers: shared hit=44248
-> Hash (cost=5.00..5.00 rows=200 width=4) (actual time=0.923..0.923 rows=200 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 16kB
-> HashAggregate (cost=3.00..5.00 rows=200 width=4) (actual time=0.606..0.759 rows=200 loops=1)
Group Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..2.50 rows=200 width=4) (actual time=0.003..0.330 rows=200 loops=1)
Planning time: 1.094 ms
Execution time: 4086.333 ms
-- '...'::hstore ? id
SELECT ('''' || string_agg(id::text || '=>NULL', ',') || '''::hstore') AS hstore_clause
FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset
EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE :hstore_clause ? id::text;
Planning time: 0.206 ms
Execution time: 5032.794 ms
-- '...'::jsonb ? id
SELECT ('''{' || string_agg('"' || id::text || '": null', ',') || '}''::jsonb') AS jsonb_clause
FROM (SELECT id FROM ids GROUP BY id ORDER BY id LIMIT 200) AS s \gset
EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM ids WHERE :jsonb_clause ? id::text;
Planning time: 0.114 ms
Execution time: 9277.307 ms
IN-VALUES clause has the bestest perfomance. So I have some questions:
- May be exist better solution?
- Does PostgreSQL have support of hashset structure? Extension (I don't found)?
- IN-VALUES clause adds new node to plan. Has additional node big overhead? How about filter by two or more IN-VALUES clause?
Thanks.