Re: Extracting superlatives - SQL design philosophy

"Pierre C" <lists@xxxxxxxxxx> · Thu, 25 Feb 2010 09:35:27 +0100

-- More explicit
select aggregate_using(max(date), city, temp, date) from bar group by
city, temp order by city;

select city, max(ROW(temp, date)) from bar group by city;

Does not work (alas) for lack of a default comparison for record type.

Another solution, which works wonders if you've got the list of cities in  
a separate table, and an index on (city, temp) is this :

SELECT c.city, (SELECT ROW( t.date, t.temp ) FROM cities_temp t WHERE  
t.city=c.city ORDER BY temp DESC LIMIT 1) FROM cities;

This will do a nested loop index scan and it is the fastest way, except if  
you have very few rows per city.
The syntax is ugly and you have to extract the stuff from the ROW()  
afterwards, though.

Unfortunately, this does not work :

SELECT c.city, (SELECT t.date, t.temp FROM cities_temp t WHERE  
t.city=c.city ORDER BY temp DESC LIMIT 1) AS m FROM cities;

because the subselect isn't allowed to return more than 1 column.

Note that you can also get the usually annoying top-N by category to use  
the index by doing something like :

SELECT c.city, (SELECT array_agg(date) FROM (SELECT t.date FROM  
cities_temp t WHERE t.city=c.city ORDER BY temp DESC LIMIT 5)) AS m FROM  
cities;

The results aren't in a very usable form either, but :

CREATE INDEX ti ON annonces( type_id, price ) WHERE price IS NOT NULL;

EXPLAIN ANALYZE SELECT
t.id, (SELECT ROW(a.id, a.price, a.date_annonce)
	FROM annonces a
	WHERE a.type_id = t.id AND price IS NOT NULL
	ORDER BY price DESC LIMIT 1)
FROM types_bien t;
                                                                  QUERY  
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on types_bien t  (cost=0.00..196.09 rows=57 width=4) (actual  
time=0.025..0.511 rows=57 loops=1)
   SubPlan 1
     ->  Limit  (cost=0.00..3.41 rows=1 width=16) (actual  
time=0.008..0.008 rows=1 loops=57)
           ->  Index Scan Backward using ti on annonces a   
(cost=0.00..8845.65 rows=2592 width=16) (actual time=0.007..0.007 rows=1  
loops=57)
                 Index Cond: (type_id = $0)
 Total runtime: 0.551 ms

explain analyze
select distinct type_id, first_value(price) over w as max_price
from annonces where price is not null
window w as (partition by type_id order by price desc);
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=30515.41..30626.87 rows=11146 width=10) (actual  
time=320.927..320.971 rows=46 loops=1)
   ->  WindowAgg  (cost=27729.14..29958.16 rows=111451 width=10) (actual  
time=195.289..282.150 rows=111289 loops=1)
         ->  Sort  (cost=27729.14..28007.76 rows=111451 width=10) (actual  
time=195.278..210.762 rows=111289 loops=1)
               Sort Key: type_id, price
               Sort Method:  quicksort  Memory: 8289kB
               ->  Seq Scan on annonces  (cost=0.00..18386.17 rows=111451  
width=10) (actual time=0.009..72.589 rows=111289 loops=1)
                     Filter: (price IS NOT NULL)
 Total runtime: 322.382 ms

Here using the index is 600x faster... worth a bit of ugly SQL, you decide.

By disabling seq_scan and bitmapscan, you can corecr this plan :

EXPLAIN ANALYZE SELECT DISTINCT ON (type_id) type_id, date_annonce, price  
FROM annonces WHERE price IS NOT NULL ORDER BY type_id, price LIMIT 40;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..78757.61 rows=33 width=14) (actual time=0.021..145.509  
rows=40 loops=1)
   ->  Unique  (cost=0.00..78757.61 rows=33 width=14) (actual  
time=0.021..145.498 rows=40 loops=1)
         ->  Index Scan using ti on annonces  (cost=0.00..78478.99  
rows=111451 width=14) (actual time=0.018..132.671 rows=110796 loops=1)
 Total runtime: 145.549 ms

This plan would be very bad (unless the whole table is in RAM) because I  
guess the index scan isn't aware of the DISTINCT ON, so it scans all rows  
in the index and in the table.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance