Re: Slow query with backwards index scan

Tilmann Singer <tils-pgsql@xxxxxxxx> · Sat, 28 Jul 2007 22:54:50 +0200

* Craig James <craig_james@xxxxxxxxxxxxxx> [20070728 22:00]:
> >>SELECT * FROM (
> >> (SELECT * FROM large_table lt
> >> WHERE lt.user_id = 12345
> >> ORDER BY created_at DESC LIMIT 10) AS q1
> >> UNION
> >> (SELECT * FROM large_table lt
> >> WHERE user_id IN (SELECT contact_id FROM relationships WHERE 
> >> user_id=12345)
> >> ORDER BY created_at DESC LIMIT 10) AS q2
> >>ORDER BY created_at DESC LIMIT 10;
> >
> >It's not possible to use ORDER BY or LIMIT within unioned queries.
> >
> >http://www.postgresql.org/docs/8.2/static/sql-select.html#SQL-UNION
> 
> If I'm reading this documentation correctly, it *is* possible, as long as 
> they're inside of a sub-select, as in this case.

I completely overlooked that obvious note in the documentation,
sorry. I tried it only with the aliases which fooled me into thinking
that doesn't work at all:

testdb=# (select 1 limit 1) as q1 union (select 2) as q2;
ERROR:  syntax error at or near "as"
LINE 1: (select 1 limit 1) as q1 union (select 2) as q2;
                           ^
but this works:

testdb=# (select 1 limit 1) union (select 2);
 ?column?
----------
        1
        2

Great - that works!

What I didn't realize in the original post is that the problem
actually seems to be how to retrieve the rows from large_table for the
correlated relationships in an efficient way - the second of the two
queries that could be UNIONed.

Using a subselect is efficient for the user with few relationships and
matched rows at the end of the sorted large_table:

testdb=# EXPLAIN ANALYZE SELECT * FROM large_table lt
 WHERE user_id IN (SELECT contact_id FROM relationships WHERE
 user_id=12345)
 ORDER BY created_at DESC LIMIT 10;
                                                                                    QUERY PLAN                                                            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6963.94..6963.96 rows=10 width=621) (actual time=94.598..94.629 rows=4 loops=1)
   ->  Sort  (cost=6963.94..6966.96 rows=1211 width=621) (actual time=94.592..94.602 rows=4 loops=1)
         Sort Key: lt.created_at
         ->  Nested Loop  (cost=39.52..6901.92 rows=1211 width=621) (actual time=85.670..94.547 rows=4 loops=1)
               ->  HashAggregate  (cost=39.52..39.53 rows=1 width=4) (actual time=23.549..23.552 rows=1 loops=1)
                     ->  Bitmap Heap Scan on relationships  (cost=4.33..39.49 rows=10 width=4) (actual time=23.526..23.530 rows=1 loops=1)
                           Recheck Cond: (user_id = 12345)
                           ->  Bitmap Index Scan on relationships_user_id_contact_id_index  (cost=0.00..4.33 rows=10 width=0) (actual time=0.027..0.027 rows=1 loops=1)
                                 Index Cond: (user_id = 12345)
               ->  Index Scan using large_user_id_started_at_index on large_table lt  (cost=0.00..6834.04 rows=2268 width=621) (actual time=62.108..70.952 rows=4 loops=1)
                     Index Cond: (lt.user_id = relationships.contact_id)
 Total runtime: 94.875 ms

But the subselect is not fast for the user with many relationships and
matched rows at the beginning of the sorted large_table:

testdb=# EXPLAIN ANALYZE SELECT * FROM large_table lt
 WHERE user_id IN (SELECT contact_id FROM relationships WHERE
 user_id=55555)
 ORDER BY created_at DESC LIMIT 10;
                                                                                       QUERY PLAN                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6963.94..6963.96 rows=10 width=621) (actual time=53187.349..53187.424 rows=10 loops=1)
   ->  Sort  (cost=6963.94..6966.96 rows=1211 width=621) (actual time=53187.341..53187.360 rows=10 loops=1)
         Sort Key: lt.created_at
         ->  Nested Loop  (cost=39.52..6901.92 rows=1211 width=621) (actual time=201.728..52673.800 rows=69018 loops=1)
               ->  HashAggregate  (cost=39.52..39.53 rows=1 width=4) (actual time=178.777..178.966 rows=40 loops=1)
                     ->  Bitmap Heap Scan on relationships  (cost=4.33..39.49 rows=10 width=4) (actual time=47.049..178.560 rows=40 loops=1)
                           Recheck Cond: (user_id = 55555)
                           ->  Bitmap Index Scan on relationships_user_id_contact_id_index  (cost=0.00..4.33 rows=10 width=0) (actual time=28.721..28.721 rows=40 loops=1)
                                 Index Cond: (user_id = 55555)
               ->  Index Scan using large_user_id_started_at_index on large_table lt  (cost=0.00..6834.04 rows=2268 width=621) (actual time=21.994..1301.375 rows=1725 loops=40)
                     Index Cond: (lt.user_id = relationships.contact_id)
 Total runtime: 53188.584 ms

Using a join now the query for matches for the user with little data is slow:

testdb=# EXPLAIN ANALYZE SELECT * FROM large_table lt
 JOIN relationships r ON lt.user_id=r.contact_id WHERE
 r.user_id=12345
 ORDER BY lt.created_at DESC LIMIT 10;
                                                                                         QUERY PLAN                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..24116.65 rows=10 width=645) (actual time=100348.436..145552.633 rows=4 loops=1)
   ->  Nested Loop  (cost=0.00..28751864.52 rows=11922 width=645) (actual time=100348.429..145552.602 rows=4 loops=1)
         ->  Index Scan Backward using large_created_at_index on large_table lt  (cost=0.00..824833.09 rows=4384343 width=621) (actual time=28.961..82448.167 rows=4384064 loops=1)
         ->  Index Scan using relationships_user_id_contact_id_index on relationships r  (cost=0.00..6.36 rows=1 width=24) (actual time=0.009..0.009 rows=0 loops=4384064)
               Index Cond: ((r.user_id = 12345) AND (lt.user_id = r.contact_id))
 Total runtime: 145552.809 ms

And for the user with much data it's fast:

testdb=# EXPLAIN ANALYZE SELECT * FROM large_table lt
 JOIN relationships r ON lt.user_id=r.contact_id WHERE
 r.user_id=55555
 ORDER BY lt.created_at DESC LIMIT 10;
                                                                                    QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..24116.65 rows=10 width=645) (actual time=0.068..0.428 rows=10 loops=1)
   ->  Nested Loop  (cost=0.00..28751864.52 rows=11922 width=645) (actual time=0.063..0.376 rows=10 loops=1)
         ->  Index Scan Backward using large_created_at_index on large_table lt  (cost=0.00..824833.09 rows=4384343 width=621) (actual time=0.028..0.064 rows=13 loops=1)
         ->  Index Scan using relationships_user_id_contact_id_index on relationships r  (cost=0.00..6.36 rows=1 width=24) (actual time=0.010..0.013 rows=1 loops=13)
               Index Cond: ((r.user_id = 55555) AND (lt.user_id = r.contact_id))
 Total runtime: 0.609 ms

Any ideas?

tia, Til

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org