Re: Subquery in a JOIN not getting restricted?

Gavin Flower <GavinFlower@xxxxxxxxxxxxxxxxx> · Sat, 12 Nov 2011 22:28:45 +1300

On 10/11/11 09:39, Jay Levitt wrote:
Kevin Grittner wrote:
Jay Levitt<jay.levitt@xxxxxxxxx>  wrote:

I don't get why the GROUP BY in this subquery forces it to scan
the entire users table (seq scan here, index scan on a larger
table) when there's only one row in users that can match:

Are you sure there's a plan significantly faster than 1.3 ms?

Yep!  Watch this:

drop schema if exists jaytest cascade;
create schema jaytest;
set search_path to jaytest;

create table questions (
  id int not null primary key,
  user_id int not null
);
insert into questions
  select generate_series(1,1100), (random()*2000000)::int;

create table users (
  id int not null primary key
);
insert into users select generate_series(1, 2000000);

vacuum freeze analyze;

explain analyze
select questions.id
from questions
join (
  select u.id
  from users as u
  group by u.id
) as s
on s.id = questions.user_id
where questions.id = 1;

-----------------------
 Merge Join  (cost=8.28..90833.02 rows=1818 width=4) (actual 
time=888.787..888.790 rows=1 loops=1)
   Merge Cond: (u.id = questions.user_id)
   ->  Group  (cost=0.00..65797.47 rows=2000000 width=4) (actual 
time=0.017..735.509 rows=1747305 loops=1)
         ->  Index Scan using users_pkey on users u  
(cost=0.00..60797.47 rows=2000000 width=4) (actual time=0.015..331.990 
rows=1747305 loops=1)
   ->  Materialize  (cost=8.28..8.29 rows=1 width=8) (actual 
time=0.013..0.015 rows=1 loops=1)
         ->  Sort  (cost=8.28..8.28 rows=1 width=8) (actual 
time=0.012..0.013 rows=1 loops=1)
               Sort Key: questions.user_id
               Sort Method:  quicksort  Memory: 25kB
               ->  Index Scan using questions_pkey on questions 
(cost=0.00..8.27 rows=1 width=8) (actual time=0.006..0.006 rows=1 
loops=1)
                     Index Cond: (id = 1)
 Total runtime: 888.832 ms
(11 rows)

explain analyze
select questions.id
from questions
join (
  select u.id
  from users as u
) as s
on s.id = questions.user_id
where questions.id = 1;

-----------------------
 Nested Loop  (cost=0.00..16.77 rows=1 width=4) (actual 
time=0.019..0.021 rows=1 loops=1)
   ->  Index Scan using questions_pkey on questions  (cost=0.00..8.27 
rows=1 width=8) (actual time=0.008..0.009 rows=1 loops=1)
         Index Cond: (id = 1)
   ->  Index Scan using users_pkey on users u  (cost=0.00..8.49 rows=1 
width=4) (actual time=0.007..0.007 rows=1 loops=1)
         Index Cond: (u.id = questions.user_id)
 Total runtime: 0.045 ms
(6 rows)

That said, there might be some room for an optimization which pushes
that test into the query with the "group by" clause.  I don't know
if there's a problem with that which I'm missing, the construct was
judged to be too rare to be worth the cost of testing for it, or
it's just that nobody has yet gotten to it.

Anyone have more insights on whether this is hard to optimize or 
simply not-yet-optimized?  And if the latter, where might I start 
looking? (Not that you -really- want me to submit a patch; my C has 
regressed to the "try an ampersand.  OK, try an asterisk." level...)

Jay

Minor note:

'PRIMARY KEY' gives you a 'NOT NULL'  constraint for free.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance