Re: Extracting superlatives - SQL design philosophy

Dave Crooke <dcrooke@xxxxxxxxx> · Wed, 24 Feb 2010 16:47:54 -0600

Garrett's is the best answer from the list .... the only fly in the
ointment here is that it performs a full sort of the records, which
isn't strictly necessary to the required output. This is functionally
equivalent to what I came up with for a MODE (most common value)
aggregation, but the syntax is a bit neater.

Craig / Geroge - there are lots of ways to do this with a subquery or
join back against the bar table. My goal was actually to avoid this
dual lookup and join, not for this contrived example but for my
real-world case where the "bar" table is not actually a table but is a
record set generated on the fly with a non-trivial subquery which I
don't want to repeat.

Mose - I think I get what you're aiming at here, but your query as
stated returns a syntax error.

What I'd like to have is a way in SQL to get an execution plan which
matches Java algorithm below, which just does one "table scan" down
the data and aggregates in place, i.e. it's O(n) with row lookups:

   HashMap<String,Row> winners = new HashMap<String,Row>();

   for (Row r : rows) {
        Row oldrow = winners.get(r.city);
        if (oldrow == null || r.temp > oldrow.temp) winnders.put(r.city, r);
    };
    for (String city : winners.keySet()) System.out.println(winners.get(city));

I'd imagine it would be possible to have a query planner optimization
that would convert Garrett's DISTINCT ON syntax to do what I was
trying to, by realizing that DISTINCT ON X ... ORDER BY Y DESC is
going to return the the one row for each X which has the highest value
of Y, and so use a MAX-structured accumulation instead of a sort.

Cheers
Dave

On Wed, Feb 24, 2010 at 3:43 PM, Garrett Murphy <gmurphy@xxxxxxxxxxxx> wrote:
> This looks to be a perfect use for SELECT DISTINCT ON:
>
> SELECT DISTINCT ON (city)
> * FROM bar
> ORDER BY city, temp desc
>
> Or am I misunderstanding the issue?
>
> Garrett Murphy
>
> -----Original Message-----
> From: pgsql-performance-owner@xxxxxxxxxxxxxx [mailto:pgsql-performance-owner@xxxxxxxxxxxxxx] On Behalf Of Dave Crooke
> Sent: Wednesday, February 24, 2010 2:31 PM
> To: pgsql-performance
> Subject:  Extracting superlatives - SQL design philosophy
>
> This is a generic SQL issue and not PG specific, but I'd like to get
> an opinion from this list.
>
> Consider the following data:
>
> # \d bar
>                 Table "public.bar"
>  Column |            Type             | Modifiers
> --------+-----------------------------+-----------
>  city   | character varying(255)      |
>  temp   | integer                     |
>  date   | timestamp without time zone |
>
> # select * from bar order by city, date;
>   city    | temp |        date
> -----------+------+---------------------
>  Austin    |   75 | 2010-02-21 15:00:00
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   56 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (5 rows)
>
> If you want the highest recorded temperature for a city, that's easy
> to do, since the selection criteria works on the same column that we
> are extracing:
>
> # select city, max(temp) from bar group by city order by 1;
>   city    | max
> -----------+-----
>  Austin    |  75
>  Edinburgh |  42
>  New York  |  78
> (3 rows)
>
>
> However there is (AFAIK) no simple way in plain SQL to write a query
> that performs such an aggregation where the aggregation criteria is on
> one column and you want to return another, e.g. adding the the *date
> of* that highest temperature to the output above, or doing a query to
> get the most recent temperature reading for each city.
>
> What I'd like to do is something like the below (and I'm inventing
> mock syntax here, the following is not valid SQL):
>
> -- Ugly implicit syntax but no worse than an Oracle outer join ;-)
> select city, temp, date from bar where date=max(date) group by city,
> temp order by city;
>
> or perhaps
>
> -- More explicit
> select aggregate_using(max(date), city, temp, date) from bar group by
> city, temp order by city;
>
> Both of the above, if they existed, would be a single data access
> followed by and sort-merge.
>
> The only way I know how to do it involves doing two accesses to the data, e.g.
>
> # select city, temp, date from bar a where date=(select max(b.date)
> from bar b where a.city=b.city) order by 1;
>   city    | temp |        date
> -----------+------+---------------------
>  Austin    |   35 | 2010-02-23 15:00:00
>  Edinburgh |   42 | 2010-02-23 15:00:00
>  New York  |   78 | 2010-06-23 15:00:00
> (3 rows)
>
>
> # explain select * from bar a where date=(select max(b.date) from bar
> b where a.city=b.city) order by 1;
>                                QUERY PLAN
> --------------------------------------------------------------------------
>  Sort  (cost=1658.86..1658.87 rows=1 width=528)
>   Sort Key: a.city
>   ->  Seq Scan on bar a  (cost=0.00..1658.85 rows=1 width=528)
>         Filter: (date = (subplan))
>         SubPlan
>           ->  Aggregate  (cost=11.76..11.77 rows=1 width=8)
>                 ->  Seq Scan on bar b  (cost=0.00..11.75 rows=1
> width=8)     -- would be an index lookup in a real scenario
>                       Filter: (($0)::text = (city)::text)
> (8 rows)
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance