Re: queries with lots of UNIONed relations

Jon Nelson <jnelson+pgsql@xxxxxxxxxxx> · Sat, 15 Jan 2011 12:15:06 -0600

On Fri, Jan 14, 2011 at 2:11 PM, Jon Nelson <jnelson+pgsql@xxxxxxxxxxx> wrote:
> On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>> Jon Nelson <jnelson+pgsql@xxxxxxxxxxx> writes:
>>> On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>>>> If you have enough memory to de-dup them individually, you surely have
>>>> enough to de-dup all at once.
>>
>>> If everything were available up-front, sure.
>>> However, and please correct me if I'm wrong, but doesn't postgresql
>>> work in a fairly linear fashion, moving from table to table performing
>>> a series of operations on each?
>>
>> Doing a single sort+uniq works like that. ÂBut the alternate plan you
>> are proposing we should consider involves building all the lower
>> hashtables, and then reading from them to fill the upper hashtable.
>> Max memory consumption *is* worst case here. ÂRemember HashAggregate
>> is incapable of swapping to disk (and if it did, you wouldn't be nearly
>> as pleased with its performance).
>
> That's not exactly what I'm proposing - but it is probably due to a
> lack of understanding some of the underlying details of how postgresql
> works. I guess I had assumed that the result of a HashAggregate or any
> other de-duplication process was a table-like structure.

And I assumed wrong, I think. I dug into the code (nodeHash.c and
others) and I think I understand now why HashAggregate works only in
certain circumstances, and I think I understand your comments a bit
better now.  Basically, HashAggregate doesn't stream unique Nodes the
way nodeUnique.c does. nodeUnique basically emits Nodes and elides
subsequent, identical Nodes, which is why it relies on the input being
sorted.  HashAggregate works only on entire input sets at once, and
nodeHash.c doesn't emit Nodes at all, really.

This makes me wonder if nodeHash.c and nodeHashjoin.c couldn't be
modified to output Nodes in a streaming fashion. The memory
requirements would not be any worse than now.

Does postgresql support any sort of merge sort?  If it did, then if
the hashtable started consuming too much memory, it could be cleared
and the nodes output from the new hashtable could be directed to
another temporary file, and then a merge sort could be performed on
all of the temporary files (and thus Unique could be used to affect
the UNION operation).

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance