Is here any reason why Postgresql calculates subqueries/storable procedures in select list before applying ORDER BY / LIMIT?
I talking about cases like:
SELECT *,
(some very slow subquery or slow storable stable/immutable procedure like xml processing)
FROM
some_table
ORDER BY
some_field (unrelated to subquery results)
LIMIT N
?
I seen cases where that lead to 3-6 orders of slowdown.
Simpliest test case:
CREATE TABLE test (id integer);
INSERT INTO test SELECT * FROM generate_series(1,1000);
Slow query (note LOOPS=1000 around subplan):
EXPLAIN ANALYZE select id,(select count(*) from test t1 where t1.id=t.id) from test t order by id limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Limit (cost=13044.61..13044.63 rows=10 width=4) (actual time=158.636..158.641 rows=10 loops=1)
-> Sort (cost=13044.61..13047.11 rows=1000 width=4) (actual time=158.636..158.639 rows=10 loops=1)
Sort Key: t.id
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on test t (cost=0.00..13023.00 rows=1000 width=4) (actual time=0.188..158.242 rows=1000 loops=1)
SubPlan 1
-> Aggregate (cost=13.00..13.01 rows=1 width=0) (actual time=0.157..0.157 rows=1 loops=1000)
-> Seq Scan on test t1 (cost=0.00..13.00 rows=1 width=0) (actual time=0.081..0.156 rows=1 loops=1000)
Filter: (id = t.id)
Total runtime: 158.676 ms
Fast query:
EXPLAIN ANALYZE select id,(select count(*) from test t1 where t1.id=t.id) from (select id from test order by id limit 10) as t order by id;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Subquery Scan on t (cost=32.11..162.36 rows=10 width=4) (actual time=1.366..4.770 rows=10 loops=1)
-> Limit (cost=32.11..32.13 rows=10 width=4) (actual time=0.971..0.983 rows=10 loops=1)
-> Sort (cost=32.11..34.61 rows=1000 width=4) (actual time=0.970..0.975 rows=10 loops=1)
Sort Key: test.id
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on test (cost=0.00..10.50 rows=1000 width=4) (actual time=0.027..0.455 rows=1000 loops=1)
SubPlan 1
-> Aggregate (cost=13.00..13.01 rows=1 width=0) (actual time=0.375..0.375 rows=1 loops=10)
-> Seq Scan on test t1 (cost=0.00..13.00 rows=1 width=0) (actual time=0.017..0.371 rows=1 loops=10)
Filter: (id = t.id)
Total runtime: 4.845 ms
Using second way is reasonable workaround for sure, but half year ago I happen to meet project where I was forced ask developers to rewrite huge pile of analitical queries on that way
to get reasonable performance (and there was a lot outcry and complaints in the process).
And ofcourse there is not always possible to create additional indexes so query will be go through index scan/backward indexscan instead of sort/limit in the top level.
Regards,
Maksym
--
Maxim Boguk
Senior Postgresql DBA.
Phone RU: +7 910 405 4718
Phone AU: +61 45 218 5678
Skype: maxim.boguk
Jabber: maxim.boguk@xxxxxxxxx
LinkedIn profile: http://nz.linkedin.com/in/maximboguk
If they can send one man to the moon... why can't they send them all?
МойКруг: http://mboguk.moikrug.ru/
Сила солому ломит, но не все в нашей жизни - солома, да и сила далеко не все.