Hi all,
I've figured out how to solve the performance issues I've been encountering with a particular query, but I'm interested in better understanding the intuition behind why the better query is so much more performant.
The query in question involves a NOT IN filter from a CTE:
WITH temp as (
SELECT temp_id
FROM other_table
WHERE ...
)
SELECT COUNT(*)
I've figured out how to solve the performance issues I've been encountering with a particular query, but I'm interested in better understanding the intuition behind why the better query is so much more performant.
The query in question involves a NOT IN filter from a CTE:
WITH temp as (
SELECT temp_id
FROM other_table
WHERE ...
)
SELECT COUNT(*)
FROM main_table m
WHERE m.main_id NOT IN (SELECT temp_id from temp);
The query plan for (the non-masked version of) this query includes:
--
Lincoln Swaine-Moore
Filter: (NOT (SubPlan 1))
Rows Removed by Filter: 1
SubPlan 1
-> CTE Scan on temp (cost=0.00..1950.60 rows=97530 width=418) (actual time=0.018..4.170 rows=15697 loops=100731)
My understanding is that PostgreSQL is not able to make this into a hashed Subplan because the expected number of rows * width of rows is too large for the work_mem setting, and so instead it has to do many repeated linear passes to find whether the various values of main_id are in the list of temp_ids.
The resolution to this problem was discovered via https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/, and involves rewriting as so:
WITH temp as (
SELECT temp_id
FROM other_table
WHERE ...
)
SELECT COUNT(*)
SELECT temp_id
FROM other_table
WHERE ...
)
SELECT COUNT(*)
FROM main_table m
WHERE NOT EXISTS (SELECT temp_id from temp where temp_id = m.main_id);
The query plan for this query (which I believe is equivalent to what it would be if I instead explicitly LEFT JOINed main_table to temp and used a WHERE to filter out NULL values of temp_id) does not involve a high number of loops like above:
-> Merge Anti Join (cost=147305.45..149523.68 rows=192712 width=4) (actual time=5050.773..5622.266 rows=158850 loops=1)
Merge Cond: (m.main_id = temp.temp_id)
-> Sort (cost=115086.98..115637.10 rows=220050 width=19) (actual time=1290.829..1655.066 rows=199226 loops=1)
Sort Key: m.main_id
Sort Method: external merge Disk: 5632kB
-> Materialize (cost=32218.47..32764.37 rows=109180 width=418) (actual time=3759.936..3787.724 rows=38268 loops=1)
-> Sort (cost=32218.47..32491.42 rows=109180 width=418) (actual time=3759.933..3771.117 rows=38268 loops=1)
Sort Key: temp.temp_id
Sort Method: quicksort Memory: 3160kB
-> CTE Scan on temp (cost=0.00..2183.60 rows=109180 width=418) (actual time=2316.745..3735.486 rows=38268 loops=1)
Instead it sorts using disk, and uses a MERGE ANTI JOIN, which I understand to eliminate the need for multiple passes through the table temp.
My primary question is: why is this approach only possible (for data too large for memory) when using NOT EXISTS, and not when using NOT IN?
I understand that there is a slight difference in the meaning of the two expressions, in that NOT IN will produce NULL if there are any NULL values in the right hand side (in this case there are none, and the queries should return the same COUNT). But if anything, I would expect that to improve performance of the NOT IN operation, since a single pass through that data should reveal if there are any NULL values, at which point that information could be used to short-circuit. So I am a bit baffled.
Thanks very much for your help!
Best,
Lincoln
Lincoln
Lincoln Swaine-Moore