* Nis Jørgensen <nis@xxxxxxxxxxxxx> [20070727 20:31]: > How does the "obvious" UNION query do - ie: > > SELECT * FROM ( > SELECT * FROM large_table lt > WHERE lt.user_id = 12345 > > UNION > > SELECT * FROM large_table lt > WHERE user_id IN (SELECT contact_id FROM relationships WHERE user_id=12345) > ) q > > ORDER BY created_at DESC LIMIT 10; Great for the user with little data: testdb=# EXPLAIN ANALYZE SELECT * FROM ( SELECT * FROM large_table lt WHERE lt.user_id = 12345 UNION SELECT * FROM large_table lt WHERE user_id IN (SELECT contact_id FROM relationships WHERE user_id=12345) ) q ORDER BY created_at DESC LIMIT 10; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=14673.77..14673.80 rows=10 width=3140) (actual time=133.877..133.946 rows=10 loops=1) -> Sort (cost=14673.77..14682.22 rows=3378 width=3140) (actual time=133.870..133.894 rows=10 loops=1) Sort Key: q.created_at -> Unique (cost=14315.34..14442.01 rows=3378 width=622) (actual time=133.344..133.705 rows=38 loops=1) -> Sort (cost=14315.34..14323.78 rows=3378 width=622) (actual time=133.337..133.432 rows=38 loops=1) Sort Key: id, user_id, plaze_id, device, started_at, updated_at, status, "type", duration, permission, created_at, mac_address, subnet, msc -> Append (cost=0.00..14117.35 rows=3378 width=622) (actual time=39.144..133.143 rows=38 loops=1) -> Index Scan using large_user_id_started_at_index on large_table lt (cost=0.00..7243.59 rows=2158 width=622) (actual time=39.138..109.831 rows=34 loops=1) Index Cond: (user_id = 12345) -> Nested Loop (cost=42.78..6839.98 rows=1220 width=622) (actual time=14.859..23.112 rows=4 loops=1) -> HashAggregate (cost=42.78..42.79 rows=1 width=4) (actual time=8.092..8.095 rows=1 loops=1) -> Bitmap Heap Scan on relationships (cost=4.34..42.75 rows=11 width=4) (actual time=8.067..8.070 rows=1 loops=1) Recheck Cond: (user_id = 12345) -> Bitmap Index Scan on relationships_user_id_index (cost=0.00..4.34 rows=11 width=0) (actual time=8.057..8.057 rows=1 loops=1) Index Cond: (user_id = 12345) -> Index Scan using large_user_id_started_at_index on large_table lt (cost=0.00..6768.48 rows=2297 width=622) (actual time=6.751..14.970 rows=4 loops=1) Index Cond: (lt.user_id = relationships.contact_id) Total runtime: 134.220 ms Not so great for the user with many early matches: testdb=# EXPLAIN ANALYZE SELECT * FROM ( SELECT * FROM large_table lt WHERE lt.user_id = 55555 UNION SELECT * FROM large_table lt WHERE user_id IN (SELECT contact_id FROM relationships WHERE user_id=55555) ) q ORDER BY created_at DESC LIMIT 10; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=14673.77..14673.80 rows=10 width=3140) (actual time=3326.304..3326.367 rows=10 loops=1) -> Sort (cost=14673.77..14682.22 rows=3378 width=3140) (actual time=3326.297..3326.318 rows=10 loops=1) Sort Key: q.created_at -> Unique (cost=14315.34..14442.01 rows=3378 width=622) (actual time=2413.070..3019.385 rows=69757 loops=1) -> Sort (cost=14315.34..14323.78 rows=3378 width=622) (actual time=2413.062..2590.354 rows=69757 loops=1) Sort Key: id, user_id, plaze_id, device, started_at, updated_at, status, "type", duration, permission, created_at, mac_address, subnet, msc -> Append (cost=0.00..14117.35 rows=3378 width=622) (actual time=0.067..1911.626 rows=69757 loops=1) -> Index Scan using large_user_id_started_at_index on large_table lt (cost=0.00..7243.59 rows=2158 width=622) (actual time=0.062..3.440 rows=739 loops=1) Index Cond: (user_id = 55555) -> Nested Loop (cost=42.78..6839.98 rows=1220 width=622) (actual time=0.451..1557.901 rows=69018 loops=1) -> HashAggregate (cost=42.78..42.79 rows=1 width=4) (actual time=0.404..0.580 rows=40 loops=1) -> Bitmap Heap Scan on relationships (cost=4.34..42.75 rows=11 width=4) (actual time=0.075..0.241 rows=40 loops=1) Recheck Cond: (user_id = 55555) -> Bitmap Index Scan on relationships_user_id_index (cost=0.00..4.34 rows=11 width=0) (actual time=0.053..0.053 rows=40 loops=1) Index Cond: (user_id = 55555) -> Index Scan using large_user_id_started_at_index on large_table lt (cost=0.00..6768.48 rows=2297 width=622) (actual time=0.048..28.033 rows=1725 loops=40) Index Cond: (lt.user_id = relationships.contact_id) Total runtime: 3327.744 ms > How about > > SELECT * FROM large_table lt > WHERE lt.user_id = 12345 OR user_id IN (SELECT contact_id FROM > relationships WHERE user_id=12345) > ORDER BY created_at DESC LIMIT 10; Not good for the one with few matches: testdb=# EXPLAIN ANALYZE SELECT * FROM large_table lt WHERE lt.user_id = 12345 OR user_id IN (SELECT contact_id FROM relationships WHERE user_id=12345) ORDER BY created_at DESC LIMIT 10; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=42.78..47.05 rows=10 width=622) (actual time=38360.090..62008.336 rows=10 loops=1) -> Index Scan Backward using large_created_at_index on large_table lt (cost=42.78..935924.84 rows=2192498 width=622) (actual time=38360.084..62008.269 rows=10 loops=1) Filter: ((user_id = 12345) OR (hashed subplan)) SubPlan -> Bitmap Heap Scan on relationships (cost=4.34..42.75 rows=11 width=4) (actual time=0.031..0.034 rows=1 loops=1) Recheck Cond: (user_id = 12345) -> Bitmap Index Scan on relationships_user_id_index (cost=0.00..4.34 rows=11 width=0) (actual time=0.020..0.020 rows=1 loops=1) Index Cond: (user_id = 12345) Total runtime: 62008.500 ms Good for the one with many early matches: testdb=# EXPLAIN ANALYZE SELECT * FROM large_table lt WHERE lt.user_id = 55555 OR user_id IN (SELECT contact_id FROM relationships WHERE user_id=55555) ORDER BY created_at DESC LIMIT 10; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=42.78..47.05 rows=10 width=622) (actual time=0.473..0.572 rows=10 loops=1) -> Index Scan Backward using large_created_at_index on large_table lt (cost=42.78..935924.84 rows=2192498 width=622) (actual time=0.467..0.512 rows=10 loops=1) Filter: ((user_id = 55555) OR (hashed subplan)) SubPlan -> Bitmap Heap Scan on relationships (cost=4.34..42.75 rows=11 width=4) (actual time=0.070..0.264 rows=40 loops=1) Recheck Cond: (user_id = 55555) -> Bitmap Index Scan on relationships_user_id_index (cost=0.00..4.34 rows=11 width=0) (actual time=0.047..0.047 rows=40 loops=1) Index Cond: (user_id = 55555) Total runtime: 0.710 ms > I am missing a unique constraint on (user_id, contact_id) - otherwise > the subselect is not equivalent to the join. > > It might be good to know whether contact_id = user_id is possible - > since this would rule out the possibility of a row satisfying both > branches of the union. Thanks for the hint - this is a rails project with the validations starting out in the application only, so sometimes we forget to also check the data integrity in the database. There were in fact a couple of duplicate user_id/contact_id pairs and a couple of rows where user_id=contact_id although those shouldn't be allowed. I added a unique index on (user_id, contact_id) and a check constraint to prevent (user_id=contact_id), vacuumed, and ran the 4 above query variants again - but it did not result in changed plans or substantial differences in execution time. > Probably you also should have foreign key constraints on > relationships.user_id and relationships.contact_id. These are unlikely > to affect performance though, in my experience. They are there, I just removed them from the schema output before posting, also assuming that they are not relevant for join performance. Til ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate