Some question about lazy subquery/procedures execution in SELECT ... ORDER BY... LIMIT N queries

Maxim Boguk <maxim.boguk@xxxxxxxxx> · Thu, 24 Nov 2011 13:56:58 +1100

Is here any reason why Postgresql calculates subqueries/storable procedures in select list before applying ORDER BY / LIMIT?

I talking about cases like:

SELECT *,
(some very slow subquery or slow storable stable/immutable procedure like xml processing)

FROM
some_table
ORDER BY
some_field (unrelated to subquery results)
LIMIT N
?

I seen cases where that lead to 3-6 orders of slowdown.

Simpliest test case:

CREATE TABLE test (id integer);

INSERT INTO test SELECT * FROM generate_series(1,1000);

Slow query (note LOOPS=1000 around subplan):
 EXPLAIN ANALYZE select id,(select count(*) from test t1 where t1.id=t.id) from test t order by id limit 10;

                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=13044.61..13044.63 rows=10 width=4) (actual time=158.636..158.641 rows=10 loops=1)

   ->  Sort  (cost=13044.61..13047.11 rows=1000 width=4) (actual time=158.636..158.639 rows=10 loops=1)
         Sort Key: t.id
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Seq Scan on test t  (cost=0.00..13023.00 rows=1000 width=4) (actual time=0.188..158.242 rows=1000 loops=1)

               SubPlan 1
                 ->  Aggregate  (cost=13.00..13.01 rows=1 width=0) (actual time=0.157..0.157 rows=1 loops=1000)
                       ->  Seq Scan on test t1  (cost=0.00..13.00 rows=1 width=0) (actual time=0.081..0.156 rows=1 loops=1000)

                             Filter: (id = t.id)
 Total runtime: 158.676 ms

Fast query:
EXPLAIN ANALYZE select id,(select count(*) from test t1 where t1.id=t.id) from (select id from test order by id limit 10) as t order by id;

                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Subquery Scan on t  (cost=32.11..162.36 rows=10 width=4) (actual time=1.366..4.770 rows=10 loops=1)

   ->  Limit  (cost=32.11..32.13 rows=10 width=4) (actual time=0.971..0.983 rows=10 loops=1)
         ->  Sort  (cost=32.11..34.61 rows=1000 width=4) (actual time=0.970..0.975 rows=10 loops=1)
               Sort Key: test.id

               Sort Method: top-N heapsort  Memory: 25kB
               ->  Seq Scan on test  (cost=0.00..10.50 rows=1000 width=4) (actual time=0.027..0.455 rows=1000 loops=1)
   SubPlan 1
     ->  Aggregate  (cost=13.00..13.01 rows=1 width=0) (actual time=0.375..0.375 rows=1 loops=10)

           ->  Seq Scan on test t1  (cost=0.00..13.00 rows=1 width=0) (actual time=0.017..0.371 rows=1 loops=10)
                 Filter: (id = t.id)
 Total runtime: 4.845 ms

Using second way is reasonable workaround for sure, but half year ago I happen to meet project where I was forced ask developers to rewrite huge pile of analitical queries on that way

to get reasonable performance (and there was a lot outcry and complaints in the process).

And ofcourse there is not always possible to create additional indexes so query will be go through index scan/backward indexscan instead of sort/limit in the top level.

Regards,
Maksym

-- 
Maxim Boguk
Senior Postgresql DBA.

Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678

Skype: maxim.boguk
Jabber: maxim.boguk@xxxxxxxxx

LinkedIn profile: http://nz.linkedin.com/in/maximboguk
If they can send one man to the moon... why can't they send them all?

МойКруг: http://mboguk.moikrug.ru/

Сила солому ломит, но не все в нашей жизни - солома, да и сила далеко не все.