Batch splitting shouldn't be followed by a hash function change?
On Mon, Apr 22, 2019, 05:09 Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxx> wrote:
On Sun, Apr 21, 2019 at 11:40:22AM -0500, Justin Pryzby wrote:
>On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:
>> Jeff Janes <jeff.janes@xxxxxxxxx> writes:
>> > The growEnabled stuff only prevents infinite loops. It doesn't prevent
>> > extreme silliness.
>>
>> > If a single 32 bit hash value has enough tuples by itself to not fit in
>> > work_mem, then it will keep splitting until that value is in a batch by
>> > itself before shutting off
>>
>> I suspect, however, that we might be better off just taking the existence
>> of the I/O buffers into account somehow while deciding whether it's worth
>> growing further. That is, I'm imagining adding a second independent
>> reason for shutting off growEnabled, along the lines of "increasing
>> nbatch any further will require an unreasonable amount of buffer memory".
>> The question then becomes how to define "unreasonable".
>
>On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote:
>> I think the question the code needs to be asking is "If we double the
>> number of batches, does the amount of memory we need drop?" And the
>> memory needs to account both for the buffers and per-batch data.
>>
>> I don't think we can just stop increasing the number of batches when the
>> memory for BufFile exceeds work_mem, because that entirely ignores the
>> fact that by doing that we force the system to keep the per-batch stuff
>> in memory (and that can be almost arbitrary amount).
>...
>> Of course, this just stops enforcing work_mem at some point, but it at
>> least attempts to minimize the amount of memory used.
>
>This patch defines reasonable as "additional BatchFiles will not themselves
>exceed work_mem; OR, exceeded work_mem already but additional BatchFiles are
>going to save us RAM"...
>
OK.
>I think the first condition is insensitive and not too important to get right,
>it only allows work_mem to be exceeded by 2x, which maybe already happens for
>multiple reasons, related to this thread and otherwise. It'd be fine to slap
>on a factor of /2 or /4 or /8 there too.
>
TBH I'm not quite sure I understand all the conditions in the patch - it
seems unnecessarily complicated. And I don't think it actually minimizes
the amount of memory used for hash table + buffers, because it keeps the
same spaceAllowed (which pushes nbatches up). At some point it actually
makes to bump spaceAllowed and make larger batches instead of adding
more batches, and the patch does not seem to do that.
Also, the patch does this:
if (hashtable->nbatch*sizeof(PGAlignedBlock) < hashtable->spaceAllowed)
{
ExecHashIncreaseNumBatches(hashtable);
}
else if (hashtable->spaceUsed/2 >= hashtable->spaceAllowed)
{
/* Exceeded spaceAllowed by 2x, so we'll save RAM by allowing nbatches to increase */
/* I think this branch would be hit almost same as below branch */
ExecHashIncreaseNumBatches(hashtable);
}
...
but the reasoning for the second branch seems wrong, because
(spaceUsed/2 >= spaceAllowed)
is not enough to guarantee that we actually save memory by doubling the
number of batches. To do that, we need to make sure that
(spaceUsed/2 >= hashtable->nbatch * sizeof(PGAlignedBlock))
But that may not be true - it certainly is not guaranteed by not getting
into the first branch.
Consider an ideal example with uniform distribution:
create table small (id bigint, val text);
create table large (id bigint, val text);
insert into large select 1000000000 * random(), md5(i::text)
from generate_series(1, 700000000) s(i);
insert into small select 1000000000 * random(), md5(i::text)
from generate_series(1, 10000) s(i);
vacuum analyze large;
vacuum analyze small;
update pg_class set (relpages, reltuples) = (1000000, 1)
where relname = 'large';
update pg_class set (relpages, reltuples) = (1, 1000000)
where relname = 'small';
set work_mem = '1MB';
explain analyze select * from small join large using (id);
A log after each call to ExecHashIncreaseNumBatches says this:
nbatch=2 spaceUsed=463200 spaceAllowed=1048576 BufFile=16384
nbatch=4 spaceUsed=463120 spaceAllowed=1048576 BufFile=32768
nbatch=8 spaceUsed=457120 spaceAllowed=1048576 BufFile=65536
nbatch=16 spaceUsed=458320 spaceAllowed=1048576 BufFile=131072
nbatch=32 spaceUsed=457120 spaceAllowed=1048576 BufFile=262144
nbatch=64 spaceUsed=459200 spaceAllowed=1048576 BufFile=524288
nbatch=128 spaceUsed=455600 spaceAllowed=1048576 BufFile=1048576
nbatch=256 spaceUsed=525120 spaceAllowed=1048576 BufFile=2097152
nbatch=256 spaceUsed=2097200 spaceAllowed=1048576 BufFile=2097152
nbatch=512 spaceUsed=2097200 spaceAllowed=1048576 BufFile=4194304
nbatch=1024 spaceUsed=2097200 spaceAllowed=1048576 BufFile=8388608
nbatch=2048 spaceUsed=2097200 spaceAllowed=1048576 BufFile=16777216
nbatch=4096 spaceUsed=2097200 spaceAllowed=1048576 BufFile=33554432
nbatch=8192 spaceUsed=2097200 spaceAllowed=1048576 BufFile=67108864
nbatch=16384 spaceUsed=2097200 spaceAllowed=1048576 BufFile=134217728
So we've succeeded in keeping spaceUsed below 2*spaceAllowed (which
seems rather confusing, BTW), but we've allocated 128MB for BufFile. So
about 130MB in total. With 16k batches.
What I think might work better is the attached v2 of the patch, with a
single top-level condition, comparing the combined memory usage
(spaceUsed + BufFile) against spaceAllowed. But it also tweaks
spaceAllowed once the size needed for BufFile gets over work_mem/3.
And it behaves like this:
nbatch=2 spaceUsed=458640 spaceAllowed=1048576 BufFile=16384
nbatch=4 spaceUsed=455040 spaceAllowed=1048576 BufFile=32768
nbatch=8 spaceUsed=440160 spaceAllowed=1048576 BufFile=65536
nbatch=16 spaceUsed=426560 spaceAllowed=1048576 BufFile=131072
nbatch=32 spaceUsed=393200 spaceAllowed=1048576 BufFile=262144
nbatch=64 spaceUsed=329120 spaceAllowed=1572864 BufFile=524288
nbatch=128 spaceUsed=455600 spaceAllowed=3145728 BufFile=1048576
nbatch=256 spaceUsed=987440 spaceAllowed=6291456 BufFile=2097152
nbatch=512 spaceUsed=2040560 spaceAllowed=12582912 BufFile=4194304
nbatch=1024 spaceUsed=4114640 spaceAllowed=25165824 BufFile=8388608
nbatch=2048 spaceUsed=8302880 spaceAllowed=50331648 BufFile=16777216
So we end up with just 2k batches, using ~24MB of memory in total.
That's because the spaceAllowed limit was bumped up instead of adding
more and more batches.
>The current patch doesn't unset growEnabled, since there's no point at which
>the hashtable should grow without bound: if hash tables are *already* exceeding
>work_mem by 2x as big, nbatches should be doubled.
>
Not sure. I guess it might be useful to re-evaluate the flag after a
while - not necessarily by actually enabling it right away, but just
checking if it'd move any tuples. Just disabling it once may be an issue
when the input data is somehow correlated, which seems to be one of the
issues with the data set discussed in this thread.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services