On Sat, Aug 9, 2014 at 5:15 AM, Josh Berkus <josh@xxxxxxxxxxxx> wrote:
Folks,
So one thing we tell users who have chronically long IN() lists is that
they should create a temporary table and join against that instead.
Other than not having the code, is there a reason why PostgreSQL
shouldn't do something like this behind the scenes, automatically?
Hi Josh,
I know that problem for many years.
There are some workaround which doesn't require using the temporary tables (and I used that approach quite a lot when performance matter):I know that problem for many years.
SELECT * FROM sometable
WHERE
somefield IN (val1, val2, ...)
AND other_filters;
Query could be written as:
SELECT * FROM sometable
JOIN (VALUES ((val1), (val2) ...)) AS v(somefield) ON v.somefield=sometable.somefield
WHERE
SELECT * FROM sometable
JOIN (VALUES ((val1), (val2) ...)) AS v(somefield) ON v.somefield=sometable.somefield
WHERE
other_filters;
When there no index on somefield query plans would look like as:
Original query:
Filter: (somefield = ANY ('{...}'::integer[]))
Filter: (somefield = ANY ('{...}'::integer[]))
vs optimized query:
Hash Join (cost=0.25..117.89 rows=22 width=59) (actual time=5.332..5.332 rows=0 loops=1)
Hash Cond: (sometable.somefield = "*VALUES*".somefield)
...
-> Hash (cost=0.12..0.12 rows=10 width=4) (actual time=0.010..0.010 rows=10 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Values Scan on "*VALUES*" (cost=0.00..0.12 rows=10 width=4) (actual time=0.001..0.003 rows=10 loops=1)
Hash Join (cost=0.25..117.89 rows=22 width=59) (actual time=5.332..5.332 rows=0 loops=1)
Hash Cond: (sometable.somefield = "*VALUES*".somefield)
...
-> Hash (cost=0.12..0.12 rows=10 width=4) (actual time=0.010..0.010 rows=10 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 1kB
-> Values Scan on "*VALUES*" (cost=0.00..0.12 rows=10 width=4) (actual time=0.001..0.003 rows=10 loops=1)
In synthetic data I observed the following performance results (fully in-memory data with integer values):
List length IN Performance JOIN VALUES Performance
10 5.39ms 5.38ms
100 9.74ms 5.49ms
1000 53.02ms 9.89ms
10000 231.10ms 13.14ms
List length IN Performance JOIN VALUES Performance
10 5.39ms 5.38ms
100 9.74ms 5.49ms
1000 53.02ms 9.89ms
10000 231.10ms 13.14ms
So starting from 10 elements VALUES/HASH JOIN approach is clear winner.
In case of the text literals IN list performance difference even more obvious (~2 order of magnitude for 10000 list).
However, if IN list used for the primary key lookup - there are no visible performance difference between these two approaches.
So yes there are some space for optimization of "Filter: (somefield = ANY ('{...}'::integer[]))" via hashing.
--
Maxim Boguk
Senior Postgresql DBA
http://www.postgresql-consulting.ru/
Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.boguk@xxxxxxxxx
МойКруг: http://mboguk.moikrug.ru/
"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."
Senior Postgresql DBA
http://www.postgresql-consulting.ru/
Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678
LinkedIn: http://www.linkedin.com/pub/maksym-boguk/80/b99/b1b
Skype: maxim.boguk
Jabber: maxim.boguk@xxxxxxxxx
МойКруг: http://mboguk.moikrug.ru/
"People problems are solved with people.
If people cannot solve the problem, try technology.
People will then wish they'd listened at the first stage."