My apologies, I'm sure this question has been asked before but I couldn't find anything on the list that meant anything to me.
We have a table "contexts" with 1.6 million rows, and a table "articles" with 1.4 million rows, where an "article" is a particular kind of "context". We want to select from a join on those two tables like this
SELECT COUNT(*)
FROM contexts
JOIN articles ON (articles.context_key=contexts.context_key)
WHERE contexts.context_key IN (...);
/* and some combination of columns from articles and contexts */
If "IN(...)" is a query, then this guy does a seq scan on the contexts table, even if the subquery is "select col_a from kgtest" where kgtest has one row. If however I read the ids beforehand and write them into the query, a la "IN (111,222,333...)", then the everything is happy, up to at least 20,000 values written into the sql, at which point smaller machines will take 2-5 minutes to parse the query.
I can certainly write the ids inline into the SQL, but when I do that I get the distinct impression that I'm Doing It Wrong. Is this expected behavior? It seems surprising to me.
To demonstrate:
/* nothing up my sleeve */
# select * from kgtest;
cola
---------
1652729
(1 row)
/* inline, good query plan */
# explain (analyze, buffers) select count(*) from contexts JOIN articles ON (articles.context_key=contexts.context_key) where contexts.context_key in (1652729);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=3.82..3.83 rows=1 width=0) (actual time=0.188..0.189 rows=1 loops=1)
Buffers: shared hit=7
-> Nested Loop (cost=0.00..3.81 rows=1 width=0) (actual time=0.181..0.181 rows=0 loops=1)
Buffers: shared hit=7
-> Index Scan using contexts_pkey on contexts (cost=0.00..1.90 rows=1 width=4) (actual time=0.109..0.112 ro
Index Cond: (context_key = 1652729)
Buffers: shared hit=4
-> Index Scan using articles_pkey on articles (cost=0.00..1.90 rows=1 width=4) (actual time=0.060..0.060 ro
Index Cond: (articles.context_key = 1652729)
Buffers: shared hit=3
Total runtime: 0.324 ms
(11 rows)
/* subselect, query plan does seq scan on contexts */
# explain (analyze, buffers) select count(*)from contexts JOIN articles ON (articles.context_key=contexts.context_key) where contexts.context_key in (select cola from kgtest);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Aggregate (cost=118505.72..118505.73 rows=1 width=0) (actual time=0.274..0.275 rows=1 loops=1)
Buffers: shared hit=5
-> Hash Join (cost=12512.61..116661.91 rows=737524 width=0) (actual time=0.269..0.269 rows=0 loops=1)
Hash Cond: (contexts.context_key = articles.context_key)
Buffers: shared hit=5
-> Seq Scan on contexts (cost=0.00..64533.03 rows=1648203 width=4) (actual time=0.009..0.009 rows=1 loops=1
Buffers: shared hit=1
-> Hash (cost=412.56..412.56 rows=737524 width=8) (actual time=0.110..0.110 rows=0 loops=1)
Buckets: 4096 Batches: 32 Memory Usage: 0kB
Buffers: shared hit=4
-> Nested Loop (cost=40.00..412.56 rows=737524 width=8) (actual time=0.107..0.107 rows=0 loops=1)
Buffers: shared hit=4
-> HashAggregate (cost=40.00..42.00 rows=200 width=4) (actual time=0.069..0.071 rows=1 loops=1)
Buffers: shared hit=1
-> Seq Scan on kgtest (cost=0.00..34.00 rows=2400 width=4) (actual time=0.048..0.050 rows
Buffers: shared hit=1
-> Index Scan using articles_pkey on articles (cost=0.00..1.84 rows=1 width=4) (actual time=0.0
Index Cond: (articles.context_key = kgtest.cola)
Buffers: shared hit=3
Total runtime: 0.442 ms
--
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgoess@xxxxxxxxxxx
510-665-1200 x179
www.bepress.com
bepress: sustainable scholarly publishing