On Fri, Jan 14, 2011 at 2:11 PM, Jon Nelson <jnelson+pgsql@xxxxxxxxxxx> wrote: > On Thu, Jan 13, 2011 at 6:10 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: >> Jon Nelson <jnelson+pgsql@xxxxxxxxxxx> writes: >>> On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote: >>>> If you have enough memory to de-dup them individually, you surely have >>>> enough to de-dup all at once. >> >>> If everything were available up-front, sure. >>> However, and please correct me if I'm wrong, but doesn't postgresql >>> work in a fairly linear fashion, moving from table to table performing >>> a series of operations on each? >> >> Doing a single sort+uniq works like that. ÂBut the alternate plan you >> are proposing we should consider involves building all the lower >> hashtables, and then reading from them to fill the upper hashtable. >> Max memory consumption *is* worst case here. ÂRemember HashAggregate >> is incapable of swapping to disk (and if it did, you wouldn't be nearly >> as pleased with its performance). > > That's not exactly what I'm proposing - but it is probably due to a > lack of understanding some of the underlying details of how postgresql > works. I guess I had assumed that the result of a HashAggregate or any > other de-duplication process was a table-like structure. And I assumed wrong, I think. I dug into the code (nodeHash.c and others) and I think I understand now why HashAggregate works only in certain circumstances, and I think I understand your comments a bit better now. Basically, HashAggregate doesn't stream unique Nodes the way nodeUnique.c does. nodeUnique basically emits Nodes and elides subsequent, identical Nodes, which is why it relies on the input being sorted. HashAggregate works only on entire input sets at once, and nodeHash.c doesn't emit Nodes at all, really. This makes me wonder if nodeHash.c and nodeHashjoin.c couldn't be modified to output Nodes in a streaming fashion. The memory requirements would not be any worse than now. Does postgresql support any sort of merge sort? If it did, then if the hashtable started consuming too much memory, it could be cleared and the nodes output from the new hashtable could be directed to another temporary file, and then a merge sort could be performed on all of the temporary files (and thus Unique could be used to affect the UNION operation). -- Jon -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance