On Sun, Apr 14, 2019 at 11:24:59PM -0400, Tom Lane wrote: > Gunther <raj@xxxxxxxx> writes: > > ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used > > Oooh, that looks like a memory leak right enough. The ExecutorState > should not get that big for any reasonable query. On Tue, Apr 16, 2019 at 11:30:19AM -0400, Tom Lane wrote: > Hmm ... this matches up with a vague thought I had that for some reason > the hash join might be spawning a huge number of separate batches. > Each batch would have a couple of files with associated in-memory > state including an 8K I/O buffer, so you could account for the On Tue, Apr 16, 2019 at 10:24:53PM -0400, Gunther wrote: > -> Hash (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127 loops=1) > Buckets: 65536 (originally 65536) Batches: 131072 (originally 2) Memory Usage: 189207kB Is it significant that there are ~2x as many ExecutorState blocks as there are batches ? 266261/131072 => 2.03... If there was 1 blocks leaked when batch=2, and 2 blocks leaked when batch=4, and 4 blocks leaked when batch=131072, then when batch=16, there'd be 64k leaked blocks, and 131072 total blocks. I'm guessing Tom probably already thought of this, but: 2230341056/266261 => ~8376 which is pretty close to the 8kB I/O buffer you were talking about (if the number of same-sized buffers much greater than other allocations). If Tom thinks (as I understand) that the issue is *not* a memory leak, but out of control increasing of nbatches, and failure to account for their size...then this patch might help. The number of batches is increased to avoid exceeding work_mem. With very low work_mem (or very larger number of tuples hashed), it'll try to use a large number of batches. At some point the memory used by BatchFiles structure (increasing by powers of two) itself exceeds work_mem. With larger work_mem, there's less need for more batches. So the number of batches used for small work_mem needs to be constrained, either based on work_mem, or at all. With my patch, the number of batches is nonlinear WRT work_mem, and reaches a maximum for moderately small work_mem. The goal is to choose the optimal number of batches to minimize the degree to which work_mem is exceeded. Justin
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c index 64eec91..4e99093 100644 --- a/src/backend/executor/nodeHash.c +++ b/src/backend/executor/nodeHash.c @@ -1036,8 +1036,11 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable) * Increasing nbatch will not fix it since there's no way to subdivide the * group any more finely. We have to just gut it out and hope the server * has enough RAM. + * Also avoid increasing nbatch if an additional nbatch would cause + * overhead of batchFiles alone to exceed work_mem. */ if (nfreed == 0 || nfreed == ninmemory) + // && (1<<hashtable->nbatch > hashtable->spaceAllowed/8192)) { hashtable->growEnabled = false; #ifdef HJDEBUG @@ -1655,8 +1658,18 @@ ExecHashTableInsert(HashJoinTable hashtable, hashtable->spacePeak = hashtable->spaceUsed; if (hashtable->spaceUsed + hashtable->nbuckets_optimal * sizeof(HashJoinTuple) - > hashtable->spaceAllowed) - ExecHashIncreaseNumBatches(hashtable); + > hashtable->spaceAllowed && hashtable->growEnabled) { + /* + * spaceUsed doesn't include the overhead of + * BatchFile structure. If the overhead of an + * additional 2x batch files would use more than the + * space itself, do not grow... + */ + if (1<<hashtable->nbatch < hashtable->spaceUsed/8192) + ExecHashIncreaseNumBatches(hashtable); + else + hashtable->growEnabled = false; + } } else {