WHERE IN (subselect) versus WHERE IN (1,2,3,)

Kevin Goess <kgoess@xxxxxxxxxxx> · Thu, 15 Mar 2012 10:47:03 +0100

My apologies, I'm sure this question has been asked before but I couldn't find anything on the list that meant anything to me.

We have a table "contexts" with 1.6 million rows, and a table "articles" with 1.4 million rows, where an "article" is a particular kind of "context".  We want to select from a join on those two tables like this

    SELECT COUNT(*) 
    FROM contexts 
    JOIN articles ON (articles.context_key=contexts.context_key) 
    WHERE contexts.context_key IN (...);
    /* and some combination of columns from articles and contexts */

If "IN(...)" is a query, then this guy does a seq scan on the contexts table, even if the subquery is "select col_a from kgtest" where kgtest has one row.  If however I read the ids beforehand and write them into the query, a la "IN (111,222,333...)", then the everything is happy, up to at least 20,000 values written into the sql, at which point smaller machines will take 2-5 minutes to parse the query.

I can certainly write the ids inline into the SQL, but when I do that I get the distinct impression that I'm Doing It Wrong.  Is this expected behavior?  It seems surprising to me. 

To demonstrate:

    /* nothing up my sleeve */
    # select * from kgtest;
      cola   
    ---------
     1652729
    (1 row)

    /* inline, good query plan */
# explain (analyze, buffers) select count(*) from contexts JOIN articles ON (articles.context_key=contexts.context_key) where contexts.context_key in (1652729);

                                                             QUERY PLAN                                               
----------------------------------------------------------------------------------------------------------------------

 Aggregate  (cost=3.82..3.83 rows=1 width=0) (actual time=0.188..0.189 rows=1 loops=1)
   Buffers: shared hit=7
   ->  Nested Loop  (cost=0.00..3.81 rows=1 width=0) (actual time=0.181..0.181 rows=0 loops=1)
         Buffers: shared hit=7

         ->  Index Scan using contexts_pkey on contexts  (cost=0.00..1.90 rows=1 width=4) (actual time=0.109..0.112 ro
               Index Cond: (context_key = 1652729)
               Buffers: shared hit=4
         ->  Index Scan using articles_pkey on articles  (cost=0.00..1.90 rows=1 width=4) (actual time=0.060..0.060 ro

               Index Cond: (articles.context_key = 1652729)
               Buffers: shared hit=3
 Total runtime: 0.324 ms
(11 rows)

  /* subselect, query plan does seq scan on contexts */

# explain (analyze, buffers) select count(*)from contexts JOIN articles ON (articles.context_key=contexts.context_key) where contexts.context_key in (select cola from kgtest);

                                                                   QUERY PLAN                                         
----------------------------------------------------------------------------------------------------------------------

 Aggregate  (cost=118505.72..118505.73 rows=1 width=0) (actual time=0.274..0.275 rows=1 loops=1)
   Buffers: shared hit=5
   ->  Hash Join  (cost=12512.61..116661.91 rows=737524 width=0) (actual time=0.269..0.269 rows=0 loops=1)

         Hash Cond: (contexts.context_key = articles.context_key)
         Buffers: shared hit=5
         ->  Seq Scan on contexts  (cost=0.00..64533.03 rows=1648203 width=4) (actual time=0.009..0.009 rows=1 loops=1

               Buffers: shared hit=1
         ->  Hash  (cost=412.56..412.56 rows=737524 width=8) (actual time=0.110..0.110 rows=0 loops=1)
               Buckets: 4096  Batches: 32  Memory Usage: 0kB
               Buffers: shared hit=4

               ->  Nested Loop  (cost=40.00..412.56 rows=737524 width=8) (actual time=0.107..0.107 rows=0 loops=1)
                     Buffers: shared hit=4
                     ->  HashAggregate  (cost=40.00..42.00 rows=200 width=4) (actual time=0.069..0.071 rows=1 loops=1)

                           Buffers: shared hit=1
                           ->  Seq Scan on kgtest  (cost=0.00..34.00 rows=2400 width=4) (actual time=0.048..0.050 rows
                                 Buffers: shared hit=1

                     ->  Index Scan using articles_pkey on articles  (cost=0.00..1.84 rows=1 width=4) (actual time=0.0
                           Index Cond: (articles.context_key = kgtest.cola)
                           Buffers: shared hit=3

 Total runtime: 0.442 ms

-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgoess@xxxxxxxxxxx

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing