Jay Levitt <jay.levitt@xxxxxxxxx> wrote: > When I run the following query: > > select questions.id > from questions > join ( > select u.id as user_id > from users as u > left join scores as s > on s.user_id = u.id > ) as subquery > on subquery.user_id = questions.user_id; > > the subquery is scanning my entire user table, even though it's > restricted by the outer query. (My real subquery is much more > complicated, of course, but this is the minimal fail case.) It's not a fail case -- it's choosing the plan it thinks is cheapest based on your costing parameters and the statistics gathered by the latest ANALYZE of the data. > Is this just not a thing the optimizer can do? Are there ways to > rewrite this, still as a subquery, that will be smart enough to > only produce the one row of subquery that matches > questions.user_id? Well, it can certainly produce the plan you seem to want, if it looks less expensive. It kinda did with the following script: create table questions (id int not null primary key, user_id int not null); insert into questions select generate_series(1,100), (random()*1000000)::int; create table users (id int not null primary key); insert into users select generate_series(1, 1000000); create table scores (id int not null primary key, user_id int not null); insert into scores select n, n from (select generate_series(1,1000000)) x(n); vacuum freeze analyze; explain analyze select questions.id from questions join ( select u.id as user_id from users as u left join scores as s on s.user_id = u.id ) as subquery on subquery.user_id = questions.user_id; Here's the plan I got, which scans the questions and then uses the index to join to the users. It's throwing the result of that into a hash table which is then checked from a sequential scan of the scores table. If I had made the scores table wider, it might have gone from the user table to scores on the index. Hash Right Join (cost=438.23..18614.23 rows=100 width=4) (actual time=2.776..161.237 rows=100 loops=1) Hash Cond: (s.user_id = u.id) -> Seq Scan on scores s (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.025..77.876 rows=1000000 loops=1) -> Hash (cost=436.98..436.98 rows=100 width=8) (actual time=0.752..0.752 rows=100 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 4kB -> Nested Loop (cost=0.00..436.98 rows=100 width=8) (actual time=0.032..0.675 rows=100 loops=1) -> Seq Scan on questions (cost=0.00..2.00 rows=100 width=8) (actual time=0.010..0.042 rows=100 loops=1) -> Index Only Scan using users_pkey on users u (cost=0.00..4.34 rows=1 width=4) (actual time=0.005..0.005 rows=1 loops=100) Index Cond: (id = questions.user_id) Total runtime: 168.585 ms If you want help figuring out whether it is choosing the fastest plan, and how to get it do better if it is not, please read this page and post the relevant information: http://wiki.postgresql.org/wiki/SlowQueryQuestions -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance