Wow, we are getting somewhere. Tom (BTW, your mail server rejects my direct mail, but I'm glad
you got it through the list), you say: I'm pretty sure that's not the droid we're looking for. ExecHashJoinGetSavedTuple does palloc a new tuple, but it immediately sticks it into a TupleTableSlot that will be responsible for freeing it (when the next tuple is stuck into the same slot). I'd suggest continuing a few times and looking for other code paths leading to AllocSetAlloc in this context. I did continue a "few times", but few as in a dozen, it's always
the same (gdb) bt 6 #0 AllocSetAlloc (context=0x1168230, size=375) at aset.c:715 #1 0x000000000084e6cd in palloc (size=375) at mcxt.c:938 #2 0x000000000061019c in ExecHashJoinGetSavedTuple (file=file@entry=0x21df688, hashvalue=hashvalue@entry=0x7fff2e4ca76c, tupleSlot=0x10856b8, hjstate=0x11688e0) at nodeHashjoin.c:1277 #3 0x0000000000610c83 in ExecHashJoinNewBatch (hjstate=0x11688e0) at nodeHashjoin.c:1042 #4 ExecHashJoinImpl (parallel=false, pstate=0x11688e0) at nodeHashjoin.c:539 #5 ExecHashJoin (pstate=0x11688e0) at nodeHashjoin.c:565 (More stack frames follow...) So I decided to just let it go until it exits the ExecHashJoin function: #6 0x00000000005fde68 in ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:461 461 result = node->ExecProcNodeReal(node); (gdb) list 456 { 457 TupleTableSlot *result; 458 459 InstrStartNode(node->instrument); 460 461 result = node->ExecProcNodeReal(node); 462 463 InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0); 464 465 return result; (gdb) break 463 Breakpoint 3 at 0x5fde68: file execProcnode.c, line 463. (gdb) disable 2 (gdb) cont Continuing. Breakpoint 3, ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:463 463 InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0); oops, that was fast, so up further ... (gdb) cont Continuing. Breakpoint 4, ExecSort (pstate=0x11687d0) at nodeSort.c:109 109 if (TupIsNull(slot)) (gdb) cont Continuing. Breakpoint 3, ExecProcNodeInstr (node=0x11688e0) at execProcnode.c:463 463 InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0); (gdb) cont Continuing. Breakpoint 4, ExecSort (pstate=0x11687d0) at nodeSort.c:109 109 if (TupIsNull(slot)) (gdb) up #1 0x00000000005fde68 in ExecProcNodeInstr (node=0x11687d0) at execProcnode.c:461 461 result = node->ExecProcNodeReal(node); (gdb) up #2 0x000000000061d2c4 in ExecProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247 247 return node->ExecProcNode(node); (gdb) up #3 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73 73 slot = ExecProcNode(outerPlan); (gdb) list 68 for (;;) 69 { 70 /* 71 * fetch a tuple from the outer subplan 72 */ 73 slot = ExecProcNode(outerPlan); 74 if (TupIsNull(slot)) 75 { 76 /* end of subplan, so we're done */ 77 ExecClearTuple(resultTupleSlot); ... but whatever I do, ultimately I get to that allocation routine through the same path. Since that is the bulk of the activity, and memory was still growing while we come through this path, I assume that this is it. My first thought on noticing the SELECT DISTINCT was that you might be hitting the grouping-function-related leak that Andres fixed in 9cf37a527; but that fix did make it into 11.2 (by just a couple of days...). Still, maybe there's another issue in the same area. I don't know about that one, I only know that I am running 11.2 freshly compiled. The change suggested by Alvaro Herrera wasn't applicable. Jeff Janes had more
Not sure how you could tell that? It's the same place as everything else. If we can find out what you're looking for, may be we can set a break point earlier up the call chain? But why? If I see the memory still go up insanely fast, isn't that a sign for the leak? How about it's in the DISTINCT? I noticed while peeking up the call chain, that it was already in the UNIQUE sort thing also. I guess it's streaming the results from the hash join right into the unique sort step. I have just done that and it creates an insane amount of output from all the processes, I'm afraid there will be no way to keep that stuff separated. If there was a way of turning that one and off for one process only, then we could probably get more info... Everything is also extremely slow that way. Like in a half hour
the memory didn't even reach 100 MB.
I guess I should look for both, address and size to match it better. Yes, massive, like I said. Impossible to use. File system fills up rapidly. I made it so that it can be turned on and off, with the debugger. int _alloc_info = 0; #ifdef HAVE_ALLOCINFO #define AllocFreeInfo(_cxt, _chunk) \ if(_alloc_info) \ fprintf(stderr, "AllocFree: %s: %p, %zu\n", \ (_cxt)->header.name, (_chunk), (_chunk)->size) #define AllocAllocInfo(_cxt, _chunk) \ if(_alloc_info) \ fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \ (_cxt)->header.name, (_chunk), (_chunk)->size) #else #define AllocFreeInfo(_cxt, _chunk) #define AllocAllocInfo(_cxt, _chunk) #endif so with this I do (gdb) b AllocSetAlloc (gdb) cont (gdb) set _alloc_info=1 (gdb) disable (gdb) cont then I wait, ... until it crashes again ... no, it's too much. It
fills up my filesystem in no time with the logs. It produced 3 GB
in just a minute of run time. And also, I doubt we can find anything specifically by allocation
size. It's just going to be 512 or whatever. Isn't there some other way? I'm going to try without that DISTINCT step, or perhaps by dismantling this query until it works without this excessive memory growth. -Gunther
|