Re: ARRAY_AGG(DISTINCT a ORDER BY b) Error

"David Johnston" <polobo@xxxxxxxxx> · Wed, 31 Aug 2011 19:40:14 -0400

-----Original Message-----
From: Tom Lane [mailto:tgl@xxxxxxxxxxxxx] 
Sent: Wednesday, August 31, 2011 7:10 PM
To: David Johnston
Cc: pgsql-general@xxxxxxxxxxxxxx
Subject: Re:  ARRAY_AGG(DISTINCT a ORDER BY b) Error 

"David Johnston" <polobo@xxxxxxxxx> writes:
> "in an aggregate with DISTINCT, ORDER BY expressions must appear in 
> argument list"

> Why?

Because the results are ill-defined otherwise.  In your example,

>	... ARRAY_AGG(DISTINCT accountnumber ORDER BY amount DESC) ...

there may be many rows with the same "accountnumber" and yet different
"amount" values.  Which of those amounts should be used as the sort key for
the aggregated row?

			regards, tom lane

----------------------------------

My take is that you look at the non-aggregated data using the same ORDER BY
clause and then add the values in the order they appear in the detail.

So (account, amount, [Action]):

1000, 150, [Add]
1000, 130, [Skip]
2000, 120, [Add]
2000, 100, [Skip]
1000, 50, [Skip]
2000, 0, [Skip]
3000, -10, [Add]
1000, -50, [Skip]

This is basically how a plain DISTINCT would have to work (taking the first
encountered value and skipping any subsequent repeats).  I want the same
general behavior but have the opportunity to order the input "frame" so that
I can manipulate the encounter order.

In a 2-phase process you would do the following (I will remain specific for
the time being):

Phase 1: Select the representative record for each DISTINCT (accountnumber);
The representative would be the FIRST record as determined via an ORDER BY
of all candidate records.
Phase 2: Order the DISTINCT representatives based upon the same ORDER BY
clause

So:
Phase 1: (3000, -10), (1000, 150), (2000, 120) [Not Ordered] 
Phase 2: (1000, 150), (2000, 120), (3000, -10) [ORDER BY amount DESC]

The behavior/concept of [ARRAY_AGG(DISTINCT ON (accountnumber) .?. ORDER BY
amount] matches here but I do not want to output an amount column at all.

The other view is that you can create the ARRAY using just the ORDER BY and
then immediately "DISTINCT"ify the array so that there are no duplicates.
This is basically the 2-phase process described above.  In this view you
basically keep the DISTINCT value that has the lowest array index.

I would be interested in other possible interpretations/algorithms that
would then cause ambiguity in deciding which algorithm to implement.

I know that I am only considering "ARRAY_AGG" in my examples but my first
reaction is that other aggregates would behave acceptably under the
algorithm described; and since the current behavior is FAIL at least some
well-defined behavior would exist for the specified syntax.

David J.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general