Re: Out of Memory errors are frustrating as heck!

Justin Pryzby <pryzby@xxxxxxxxxxxxx> · Sat, 20 Apr 2019 14:30:09 -0500

On Sun, Apr 14, 2019 at 11:24:59PM -0400, Tom Lane wrote:
> Gunther <raj@xxxxxxxx> writes:
> >        ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used
> 
> Oooh, that looks like a memory leak right enough.  The ExecutorState
> should not get that big for any reasonable query.

On Tue, Apr 16, 2019 at 11:30:19AM -0400, Tom Lane wrote:
> Hmm ... this matches up with a vague thought I had that for some reason
> the hash join might be spawning a huge number of separate batches.
> Each batch would have a couple of files with associated in-memory
> state including an 8K I/O buffer, so you could account for the

On Tue, Apr 16, 2019 at 10:24:53PM -0400, Gunther wrote:
>    ->  Hash  (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127 loops=1)
>          Buckets: 65536 (originally 65536)  Batches: 131072 (originally 2)  Memory Usage: 189207kB

Is it significant that there are ~2x as many ExecutorState blocks as there are
batches ?  266261/131072 => 2.03...

If there was 1 blocks leaked when batch=2, and 2 blocks leaked when batch=4,
and 4 blocks leaked when batch=131072, then when batch=16, there'd be 64k
leaked blocks, and 131072 total blocks.

I'm guessing Tom probably already thought of this, but:
2230341056/266261 => ~8376
which is pretty close to the 8kB I/O buffer you were talking about (if the
number of same-sized buffers much greater than other allocations).

If Tom thinks (as I understand) that the issue is *not* a memory leak, but out
of control increasing of nbatches, and failure to account for their size...then
this patch might help.

The number of batches is increased to avoid exceeding work_mem.  With very low
work_mem (or very larger number of tuples hashed), it'll try to use a large
number of batches.  At some point the memory used by BatchFiles structure
(increasing by powers of two) itself exceeds work_mem.

With larger work_mem, there's less need for more batches.  So the number of
batches used for small work_mem needs to be constrained, either based on
work_mem, or at all.

With my patch, the number of batches is nonlinear WRT work_mem, and reaches a
maximum for moderately small work_mem.  The goal is to choose the optimal
number of batches to minimize the degree to which work_mem is exceeded.

Justin

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 64eec91..4e99093 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -1036,8 +1036,11 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 * Increasing nbatch will not fix it since there's no way to subdivide the
 	 * group any more finely. We have to just gut it out and hope the server
 	 * has enough RAM.
+	 * Also avoid increasing nbatch if an additional nbatch would cause
+	 * overhead of batchFiles alone to exceed work_mem.
 	 */
 	if (nfreed == 0 || nfreed == ninmemory)
+		// && (1<<hashtable->nbatch > hashtable->spaceAllowed/8192))
 	{
 		hashtable->growEnabled = false;
 #ifdef HJDEBUG
@@ -1655,8 +1658,18 @@ ExecHashTableInsert(HashJoinTable hashtable,
 			hashtable->spacePeak = hashtable->spaceUsed;
 		if (hashtable->spaceUsed +
 			hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
-			> hashtable->spaceAllowed)
-			ExecHashIncreaseNumBatches(hashtable);
+			> hashtable->spaceAllowed && hashtable->growEnabled) {
+			/*
+			 * spaceUsed doesn't include the overhead of
+			 * BatchFile structure.  If the overhead of an
+			 * additional 2x batch files would use more than the
+			 * space itself, do not grow...
+			 */
+			if (1<<hashtable->nbatch < hashtable->spaceUsed/8192)
+				ExecHashIncreaseNumBatches(hashtable);
+			else
+				hashtable->growEnabled = false;
+		}
 	}
 	else
 	{