Re: Merging large volumes of data

Tom Lane <tgl@xxxxxxxxxxxxx> · Mon, 07 May 2007 11:09:28 -0400

Andreas Kostyrka <andreas@xxxxxxxxxxxx> writes:
> Ambrus Wagner (IJ/ETH) wrote:
>> Is there a way to prevent PostgreSQL from doing a full sort on the result set after the unions have been completed? Even if I write
>> 
>> (select a,b,c,d,e from table1 order by a,b) union all
>> (select a,b,c,d,e from table2 order by a,b)  union all
>> etc...
>> (select a,b,c,d,e from tablen order by a,b)  order by a,b;
>> 
>> PostgreSQL does not seem to realise (maybe it should not be able to do this trick anyway) that the last "order by" clause is merely a final merge step on the ordered data sets.

At least to a first-order approximation, teaching it this would be a
waste of time.

Assume for simplicity that each of your K sub-selects yields N tuples.
Then there are KN items altogether, so if we just sort the big data set
it takes O(KN*log(KN)) time ... which is the same as O(KN*(log K + log N)).
OTOH, if we sort each sub-select by itself, that takes O(N*log N) time,
or O(KN*log N) for all K sub-sorts.  Now we've got to do a K-way merge,
which will take O(log K) comparisons for each of the KN tuples, ie,
O(KN*log K)).  Net result: exactly the same runtime.

Of course this argument fails if you have some way of obtaining the
sub-select values pre-sorted for free.  But it's never really free.
Historical experience is that full-table indexscans often underperform
explicit sorts, at least when there are enough tuples involved to
make the problem interesting.

So the bottom line is that the use-case for this optimization seems
far too narrow to justify the implementation effort.

			regards, tom lane