Re: WHERE IN (subselect) versus WHERE IN (1,2,3,)

"Albe Laurenz" <laurenz.albe@xxxxxxxxxx> · Mon, 19 Mar 2012 09:24:37 +0100

Kevin Goess wrote:
> We have a table "contexts" with 1.6 million rows, and a table
"articles" with 1.4 million rows, where
> an "article" is a particular kind of "context".  We want to select
from a join on those two tables
> like this
> 
>     SELECT COUNT(*)
>     FROM contexts
>     JOIN articles ON (articles.context_key=contexts.context_key)
>     WHERE contexts.context_key IN (...);
>     /* and some combination of columns from articles and contexts */
> 
> If "IN(...)" is a query, then this guy does a seq scan on the contexts
table, even if the subquery is
> "select col_a from kgtest" where kgtest has one row.  If however I
read the ids beforehand and write
> them into the query, a la "IN (111,222,333...)", then the everything
is happy, up to at least 20,000
> values written into the sql, at which point smaller machines will take
2-5 minutes to parse the query.
> 
> I can certainly write the ids inline into the SQL, but when I do that
I get the distinct impression
> that I'm Doing It Wrong.  Is this expected behavior?  It seems
surprising to me.
> 
> 
> To demonstrate:
> 
>     /* nothing up my sleeve */
>     # select * from kgtest;
>       cola
>     ---------
>      1652729
>     (1 row)

[...]

>   /* subselect, query plan does seq scan on contexts */
[...]
>     ->  Seq Scan on kgtest  (cost=0.00..34.00 rows=2400 width=4)
(actual time=0.048..0.050 rows
[...]

There is something missing in this line, but according to what you wrote
it must be "actual [...] rows=1", And yet the planner assumes that the
scan will return 2400 rows.
That means that your statistics are not accurate.

As a first measure, you should ANALYZE the tables involved and see if
the problem persists.  If yes, post the new plans.

Yours,
Laurenz Albe

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general