Re: Subquery in a JOIN not getting restricted?

Jay Levitt <jay.levitt@xxxxxxxxx> · Wed, 09 Nov 2011 15:39:14 -0500

Kevin Grittner wrote:
Jay Levitt<jay.levitt@xxxxxxxxx>  wrote:

I don't get why the GROUP BY in this subquery forces it to scan
the entire users table (seq scan here, index scan on a larger
table) when there's only one row in users that can match:

Are you sure there's a plan significantly faster than 1.3 ms?

Yep!  Watch this:

drop schema if exists jaytest cascade;
create schema jaytest;
set search_path to jaytest;

create table questions (
  id int not null primary key,
  user_id int not null
);
insert into questions
  select generate_series(1,1100), (random()*2000000)::int;

create table users (
  id int not null primary key
);
insert into users select generate_series(1, 2000000);

vacuum freeze analyze;

explain analyze
select questions.id
from questions
join (
  select u.id
  from users as u
  group by u.id
) as s
on s.id = questions.user_id
where questions.id = 1;

-----------------------
 Merge Join  (cost=8.28..90833.02 rows=1818 width=4) (actual 
time=888.787..888.790 rows=1 loops=1)
   Merge Cond: (u.id = questions.user_id)
   ->  Group  (cost=0.00..65797.47 rows=2000000 width=4) (actual 
time=0.017..735.509 rows=1747305 loops=1)
         ->  Index Scan using users_pkey on users u  (cost=0.00..60797.47 
rows=2000000 width=4) (actual time=0.015..331.990 rows=1747305 loops=1)
   ->  Materialize  (cost=8.28..8.29 rows=1 width=8) (actual 
time=0.013..0.015 rows=1 loops=1)
         ->  Sort  (cost=8.28..8.28 rows=1 width=8) (actual 
time=0.012..0.013 rows=1 loops=1)
               Sort Key: questions.user_id
               Sort Method:  quicksort  Memory: 25kB
               ->  Index Scan using questions_pkey on questions 
(cost=0.00..8.27 rows=1 width=8) (actual time=0.006..0.006 rows=1 loops=1)
                     Index Cond: (id = 1)
 Total runtime: 888.832 ms
(11 rows)

explain analyze
select questions.id
from questions
join (
  select u.id
  from users as u
) as s
on s.id = questions.user_id
where questions.id = 1;

-----------------------
 Nested Loop  (cost=0.00..16.77 rows=1 width=4) (actual time=0.019..0.021 
rows=1 loops=1)
   ->  Index Scan using questions_pkey on questions  (cost=0.00..8.27 
rows=1 width=8) (actual time=0.008..0.009 rows=1 loops=1)
         Index Cond: (id = 1)
   ->  Index Scan using users_pkey on users u  (cost=0.00..8.49 rows=1 
width=4) (actual time=0.007..0.007 rows=1 loops=1)
         Index Cond: (u.id = questions.user_id)
 Total runtime: 0.045 ms
(6 rows)

That said, there might be some room for an optimization which pushes
that test into the query with the "group by" clause.  I don't know
if there's a problem with that which I'm missing, the construct was
judged to be too rare to be worth the cost of testing for it, or
it's just that nobody has yet gotten to it.

Anyone have more insights on whether this is hard to optimize or simply 
not-yet-optimized?  And if the latter, where might I start looking? (Not 
that you -really- want me to submit a patch; my C has regressed to the "try 
an ampersand.  OK, try an asterisk." level...)

Jay

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance