Re: Out of Memory errors are frustrating as heck!

Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxx> · Mon, 22 Apr 2019 05:09:27 +0200

On Sun, Apr 21, 2019 at 11:40:22AM -0500, Justin Pryzby wrote:
On Sun, Apr 21, 2019 at 10:36:43AM -0400, Tom Lane wrote:
Jeff Janes <jeff.janes@xxxxxxxxx> writes:
> The growEnabled stuff only prevents infinite loops.  It doesn't prevent
> extreme silliness.

> If a single 32 bit hash value has enough tuples by itself to not fit in
> work_mem, then it will keep splitting until that value is in a batch by
> itself before shutting off

I suspect, however, that we might be better off just taking the existence
of the I/O buffers into account somehow while deciding whether it's worth
growing further.  That is, I'm imagining adding a second independent
reason for shutting off growEnabled, along the lines of "increasing
nbatch any further will require an unreasonable amount of buffer memory".
The question then becomes how to define "unreasonable".

On Sun, Apr 21, 2019 at 06:15:25PM +0200, Tomas Vondra wrote:
I think the question the code needs to be asking is "If we double the
number of batches, does the amount of memory we need drop?" And the
memory needs to account both for the buffers and per-batch data.

I don't think we can just stop increasing the number of batches when the
memory for BufFile exceeds work_mem, because that entirely ignores the
fact that by doing that we force the system to keep the per-batch stuff
in memory (and that can be almost arbitrary amount).
...
Of course, this just stops enforcing work_mem at some point, but it at
least attempts to minimize the amount of memory used.

This patch defines reasonable as "additional BatchFiles will not themselves
exceed work_mem; OR, exceeded work_mem already but additional BatchFiles are
going to save us RAM"...


OK.

I think the first condition is insensitive and not too important to get right,
it only allows work_mem to be exceeded by 2x, which maybe already happens for
multiple reasons, related to this thread and otherwise.  It'd be fine to slap
on a factor of /2 or /4 or /8 there too.


TBH I'm not quite sure I understand all the conditions in the patch - it
seems unnecessarily complicated. And I don't think it actually minimizes
the amount of memory used for hash table + buffers, because it keeps the
same spaceAllowed (which pushes nbatches up). At some point it actually
makes to bump spaceAllowed and make larger batches instead of adding
more batches, and the patch does not seem to do that.

Also, the patch does this:

   if (hashtable->nbatch*sizeof(PGAlignedBlock) < hashtable->spaceAllowed)
   {
       ExecHashIncreaseNumBatches(hashtable);
   }
   else if (hashtable->spaceUsed/2 >= hashtable->spaceAllowed)
   {
       /* Exceeded spaceAllowed by 2x, so we'll save RAM by allowing nbatches to increase */
       /* I think this branch would be hit almost same as below branch */
       ExecHashIncreaseNumBatches(hashtable);
   }
   ...

but the reasoning for the second branch seems wrong, because

   (spaceUsed/2 >= spaceAllowed)

is not enough to guarantee that we actually save memory by doubling the
number of batches. To do that, we need to make sure that

  (spaceUsed/2 >= hashtable->nbatch * sizeof(PGAlignedBlock))

But that may not be true - it certainly is not guaranteed by not getting
into the first branch.

Consider an ideal example with uniform distribution:

   create table small (id bigint, val text);
   create table large (id bigint, val text);

   insert into large select 1000000000 * random(), md5(i::text)
   from generate_series(1, 700000000) s(i);

   insert into small select 1000000000 * random(), md5(i::text)
   from generate_series(1, 10000) s(i);

   vacuum analyze large;
   vacuum analyze small;

   update pg_class set (relpages, reltuples) = (1000000, 1)
   where relname = 'large';

   update pg_class set (relpages, reltuples) = (1, 1000000)
   where relname = 'small';

   set work_mem = '1MB';

   explain analyze select * from small join large using (id);

A log after each call to ExecHashIncreaseNumBatches says this:

   nbatch=2 spaceUsed=463200 spaceAllowed=1048576 BufFile=16384
   nbatch=4 spaceUsed=463120 spaceAllowed=1048576 BufFile=32768
   nbatch=8 spaceUsed=457120 spaceAllowed=1048576 BufFile=65536
   nbatch=16 spaceUsed=458320 spaceAllowed=1048576 BufFile=131072
   nbatch=32 spaceUsed=457120 spaceAllowed=1048576 BufFile=262144
   nbatch=64 spaceUsed=459200 spaceAllowed=1048576 BufFile=524288
   nbatch=128 spaceUsed=455600 spaceAllowed=1048576 BufFile=1048576
   nbatch=256 spaceUsed=525120 spaceAllowed=1048576 BufFile=2097152
   nbatch=256 spaceUsed=2097200 spaceAllowed=1048576 BufFile=2097152
   nbatch=512 spaceUsed=2097200 spaceAllowed=1048576 BufFile=4194304
   nbatch=1024 spaceUsed=2097200 spaceAllowed=1048576 BufFile=8388608
   nbatch=2048 spaceUsed=2097200 spaceAllowed=1048576 BufFile=16777216
   nbatch=4096 spaceUsed=2097200 spaceAllowed=1048576 BufFile=33554432
   nbatch=8192 spaceUsed=2097200 spaceAllowed=1048576 BufFile=67108864
   nbatch=16384 spaceUsed=2097200 spaceAllowed=1048576 BufFile=134217728

So we've succeeded in keeping spaceUsed below 2*spaceAllowed (which
seems rather confusing, BTW), but we've allocated 128MB for BufFile. So
about 130MB in total. With 16k batches.

What I think might work better is the attached v2 of the patch, with a
single top-level condition, comparing the combined memory usage
(spaceUsed + BufFile) against spaceAllowed. But it also tweaks
spaceAllowed once the size needed for BufFile gets over work_mem/3.

And it behaves like this:

   nbatch=2 spaceUsed=458640 spaceAllowed=1048576 BufFile=16384
   nbatch=4 spaceUsed=455040 spaceAllowed=1048576 BufFile=32768
   nbatch=8 spaceUsed=440160 spaceAllowed=1048576 BufFile=65536
   nbatch=16 spaceUsed=426560 spaceAllowed=1048576 BufFile=131072
   nbatch=32 spaceUsed=393200 spaceAllowed=1048576 BufFile=262144
   nbatch=64 spaceUsed=329120 spaceAllowed=1572864 BufFile=524288
   nbatch=128 spaceUsed=455600 spaceAllowed=3145728 BufFile=1048576
   nbatch=256 spaceUsed=987440 spaceAllowed=6291456 BufFile=2097152
   nbatch=512 spaceUsed=2040560 spaceAllowed=12582912 BufFile=4194304
   nbatch=1024 spaceUsed=4114640 spaceAllowed=25165824 BufFile=8388608
   nbatch=2048 spaceUsed=8302880 spaceAllowed=50331648 BufFile=16777216

So we end up with just 2k batches, using ~24MB of memory in total.
That's because the spaceAllowed limit was bumped up instead of adding
more and more batches.

The current patch doesn't unset growEnabled, since there's no point at which
the hashtable should grow without bound: if hash tables are *already* exceeding
work_mem by 2x as big, nbatches should be doubled.


Not sure. I guess it might be useful to re-evaluate the flag after a
while - not necessarily by actually enabling it right away, but just
checking if it'd move any tuples. Just disabling it once may be an issue
when the input data is somehow correlated, which seems to be one of the
issues with the data set discussed in this thread.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 64eec91f8b..667a23eed0 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1653,10 +1653,54 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		hashtable->spaceUsed += hashTupleSize;
 		if (hashtable->spaceUsed > hashtable->spacePeak)
 			hashtable->spacePeak = hashtable->spaceUsed;
+
+		/*
+		 * Consider increasing number of batches.
+		 *
+		 * Each batch requires a non-trivial amount of memory, because BufFile
+		 * includes a PGAlignedBlock (typically 8kB buffer). So when doubling
+		 * the number of batches, we need to be careful and only allow that if
+		 * it actually has a chance of reducing memory usage.
+		 *
+		 * In particular, doubling the number of batches is pointless when
+		 *
+		 *		(spaceUsed / 2) < (nbatches * sizeof(BufFile))
+		 *
+		 * because we expect to save roughly 1/2 of memory currently used for
+		 * data (rows) at the price of doubling the memory used for BufFile.
+		 *
+		 * We can't stop adding batches entirely, because that would just mean
+		 * the batches would need more and more memory. So we need to increase
+		 * the number of batches, even if we can't enforce work_mem properly.
+		 * The goal is to minimize the overall memory usage of the hash join.
+		 *
+		 * Note: This applies mostly to cases of significant underestimates,
+		 * resulting in an explosion of the number of batches. The properly
+		 * estimated cases should generally end up using merge join based on
+		 * high cost of the batched hash join.
+		 */
 		if (hashtable->spaceUsed +
-			hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
+			hashtable->nbuckets_optimal * sizeof(HashJoinTuple) +
+			hashtable->nbatch * sizeof(PGAlignedBlock)
 			> hashtable->spaceAllowed)
+		{
 			ExecHashIncreaseNumBatches(hashtable);
+
+			/*
+			 * Consider increasing the resize threshold. For well estimated cases
+			 * this does nothing, because batches are expected to account only for
+			 * small fraction of work_mem. But if we significantly underestimate
+			 * the number of batches, we may end up in a situation where BufFile
+			 * alone exceed work_mem. So move the threshold a bit, until the next
+			 * point where it'll make sense to consider adding batches again.
+			 */
+			hashtable->spaceAllowed
+				= Max(hashtable->spaceAllowed,
+					  hashtable->nbatch * sizeof(PGAlignedBlock) * 3);
+
+			elog(WARNING, "ExecHashIncreaseNumBatches: nbatch=%d spaceAllowed=%ld",
+				 hashtable->nbatch, hashtable->spaceAllowed);
+		}
 	}
 	else
 	{
@@ -2661,6 +2705,8 @@ ExecHashGetInstrumentation(HashInstrumentation *instrument,
 	instrument->nbatch = hashtable->nbatch;
 	instrument->nbatch_original = hashtable->nbatch_original;
 	instrument->space_peak = hashtable->spacePeak;
+	/* account for memory used for BufFile */
+	instrument->space_peak += hashtable->nbatch * sizeof(PGAlignedBlock);
 }
 
 /*