Re: queries with lots of UNIONed relations

Jon Nelson <jnelson+pgsql@xxxxxxxxxxx> · Thu, 13 Jan 2011 17:14:04 -0600

On Thu, Jan 13, 2011 at 5:05 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
> Robert Haas <robertmhaas@xxxxxxxxx> writes:
>> On Thu, Jan 13, 2011 at 5:26 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
>>> I don't believe there is any case where hashing each individual relation
>>> is a win compared to hashing them all together. ÂIf the optimizer were
>>> smart enough to be considering the situation as a whole, it would always
>>> do the latter.
>
>> You might be right, but I'm not sure. ÂSuppose that there are 100
>> inheritance children, and each has 10,000 distinct values, but none of
>> them are common between the tables. ÂIn that situation, de-duplicating
>> each individual table requires a hash table that can hold 10,000
>> entries. ÂBut deduplicating everything at once requires a hash table
>> that can hold 1,000,000 entries.
>
>> Or am I all wet?
>
> If you have enough memory to de-dup them individually, you surely have
> enough to de-dup all at once. ÂIt is not possible for a single hashtable
> to have worse memory consumption than N hashtables followed by a union
> hashtable, and in fact if there are no common values then the latter eats
> twice as much space because every value appears twice in two different
> hashtables.

If everything were available up-front, sure.
However, and please correct me if I'm wrong, but doesn't postgresql
work in a fairly linear fashion, moving from table to table performing
a series of operations on each? That seems to indicate that is what
the plan is:

Compare:

for each table LOOP
  scan table for result rows, append to results
END LOOP
hash / sort + unique results

versus:

for each table LOOP
  scan table for result rows, append to table-results
  hash / sort+unique table-results, append to results
END LOOP
hash / sort + unique results

In the former case, all of the result rows from all tables are
appended together before the de-duplification process can start.

In the latter case, only enough memory for each table's result set is
necessary for de-duplification, and it would only be necessary to
allocate it for that table.

Is that not how this works?

-- 
Jon

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance