Re: Subquery in a JOIN not getting restricted?

"Kevin Grittner" <Kevin.Grittner@xxxxxxxxxxxx> · Mon, 07 Nov 2011 15:53:33 -0600

Jay Levitt <jay.levitt@xxxxxxxxx> wrote:
> When I run the following query:
> 
> select questions.id
> from questions
> join (
>      select u.id as user_id
>      from users as u
>      left join scores as s
>      on s.user_id = u.id
> ) as subquery
> on subquery.user_id = questions.user_id;
> 
> the subquery is scanning my entire user table, even though it's
> restricted by the outer query.  (My real subquery is much more
> complicated, of course, but this is the minimal fail case.)

It's not a fail case -- it's choosing the plan it thinks is cheapest
based on your costing parameters and the statistics gathered by the
latest ANALYZE of the data.

> Is this just not a thing the optimizer can do?  Are there ways to
> rewrite this, still as a subquery, that will be smart enough to
> only produce the one row of subquery that matches
> questions.user_id?

Well, it can certainly produce the plan you seem to want, if it
looks less expensive.  It kinda did with the following script:

create table questions
  (id int not null primary key, user_id int not null);

insert into questions
  select generate_series(1,100), (random()*1000000)::int;

create table users (id int not null primary key);

insert into users select generate_series(1, 1000000);

create table scores
  (id int not null primary key, user_id int not null);

insert into scores select n, n
  from (select generate_series(1,1000000)) x(n);

vacuum freeze analyze;

explain analyze
select questions.id
from questions
join (
     select u.id as user_id
     from users as u
     left join scores as s
     on s.user_id = u.id
) as subquery
on subquery.user_id = questions.user_id;

Here's the plan I got, which scans the questions and then uses the
index to join to the users.  It's throwing the result of that into a
hash table which is then checked from a sequential scan of the
scores table.  If I had made the scores table wider, it might have
gone from the user table to scores on the index.

 Hash Right Join
     (cost=438.23..18614.23 rows=100 width=4)
     (actual time=2.776..161.237 rows=100 loops=1)
   Hash Cond: (s.user_id = u.id)
   ->  Seq Scan on scores s
           (cost=0.00..14425.00 rows=1000000 width=4)
           (actual time=0.025..77.876 rows=1000000 loops=1)
   ->  Hash
           (cost=436.98..436.98 rows=100 width=8)
           (actual time=0.752..0.752 rows=100 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 4kB
         ->  Nested Loop
                 (cost=0.00..436.98 rows=100 width=8)
                 (actual time=0.032..0.675 rows=100 loops=1)
               ->  Seq Scan on questions
                       (cost=0.00..2.00 rows=100 width=8)
                       (actual time=0.010..0.042 rows=100 loops=1)
               ->  Index Only Scan using users_pkey on users u
                       (cost=0.00..4.34 rows=1 width=4)
                       (actual time=0.005..0.005 rows=1 loops=100)
                     Index Cond: (id = questions.user_id)
 Total runtime: 168.585 ms

If you want help figuring out whether it is choosing the fastest
plan, and how to get it do better if it is not, please read this
page and post the relevant information:

http://wiki.postgresql.org/wiki/SlowQueryQuestions

-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance