Re: Idea: PostgreSQL equivalent to Oracle's KEEP clause

Ben Clements <benhasgonewalking@xxxxxxxxx> · Mon, 13 Mar 2023 23:07:02 -0400

Thanks David.

Similar to your "TOP() and BOTTOM() aggregate" idea, you might find Erwin Brandstetter's solution using the LAST() aggregate function interesting: (https://dba.stackexchange.com/a/324646/100880)

If the FIRST_LAST_AGG extension is installed, then we can do something like this:
SELECT country
     , count(*) AS ct_cities
     , max(population) AS highest_population
     , last(city ORDER BY population, city) AS biggest_city  -- !
FROM   cities
GROUP  BY country
HAVING count(*) > 1;

-Ben

On Mon, Mar 6, 2023 at 9:51 PM David Rowley <dgrowleyml@xxxxxxxxx> wrote:
On Tue, 7 Mar 2023 at 12:40, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:

>

> Ben Clements <benhasgonewalking@xxxxxxxxx> writes:

> > As shown above, the following calculated column can bring in the city name,

> > even though the city name isn't in the GROUP BY:

> >    max(city) keep (dense_rank first order by population desc)

>

> You haven't really explained what this does, let alone why it can't

> be implemented with existing features such as FILTER and ORDER BY.

(It wasn't clear to me until I watched the youtube video.)

Likely KEEP is more flexible than just the given example but I think

that something similar to the example given could be done by inventing

a TOP() and BOTTOM() aggregate. Then you could write something like:

select

   country,

   count(*),

   max(population),

   bottom(city, population)

from

   cities

group by

   country

having

   count(*) > 1

the transfn for bottom() would need to remember the city and the

population for the highest yet seen value of the 2nd arg.  The

combinefn would need to find the aggregate state with the highest 2nd

arg value, the finalfn would just spit out the column that's stored in

the state.  Where this wouldn't work would be if multiple columns were

required to tiebreak the sort.

You could at least parallelize the aggregation this way. If there were

to be some form of ORDER BY in the aggregate then no parallelization

would be possible.  I'd assume since the whole thing can be done with

a subquery that the entire point of having special syntax for this

would be because we don't want to pay the price of looking at the

table twice, i.e. performance must matter, so the ability to have

parallel aggregates here seems good.

I can't quite think of a way to have parallel query and an arbitrarily

long list of columns to sort on...

For Ben, we do tend to shy away from copying other RDBMS's extensions

to the SQL language.  The problem is that copying these can cause

problems in the future if/when the standard adopts that syntax with

variations or invents something else that conflicts with the grammar

that we've added.  One example of something we didn't do was Oracle's

CONNECT BY.  Eventually, the SQL standard got WITH RECURSIVE to allow

queries on hierarchical data. Of course, we do have many of our own

extensions to the standard, so we certainly do make exceptions

sometimes. So, don't be too surprised that there's some discussion of

other methods which might make this work which don't involve copying

what someone else has done.

David