I saw your replies, if there was a way of using gdb commands to
have a conditional breakpoint which will only fire if the n-th
caller in the chain is not a certain source location, then one
could exclude the bulk of these allocations and focus better. But I decided I try to re-factor this query. And I made an
interesting observation. There is a left outer join in parenthesis ... LEFT OUTER JOIN (SELECT ....) q ... the biggest parenthesis. I turned this into a temporary table,
tmp_bulk. Then I change the main query to ... LEFT OUTER JOIN tmp_bulk q ... now I am running it again. But what I noticed is that the
tmp_bulk table is tiny! It only has like 250 rows. So this means
the vast majority of the right left rows in that join are
unmatched. The plan is all different now. Heavy CPU% load. Must be
merge sorting? No memory growth, not yet. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5394 postgres 20 0 1284448 287880 271528 R 99.3 3.6 9:21.83 postgres: postgres integrator [local] EXPLAIN 5425 postgres 20 0 1278556 93184 82296 S 27.6 1.2 0:38.72 postgres: parallel worker for PID 5394 No, I never trust when a database job has high CPU% and low IO for a long time. So I do SET ENABLE_MERGEJOIN TO OFF; and then do it again. Now I have high IO and low CPU%. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5394 postgres 20 0 1280904 282036 273616 D 2.3 3.6 13:01.51 postgres: postgres integrator [local] EXPLAIN 5510 postgres 20 0 1278892 87524 80804 D 2.3 1.1 0:05.20 postgres: parallel worker for PID 5394 5509 postgres 20 0 1278860 87484 80796 D 2.3 1.1 0:05.30 postgres: parallel worker for PID 5394 Still I slip into the high CPU% situation, I guess I'll have to wait it out ... ... and still waiting. No growth whatsoever. The plan is now so totally different that it probably won't trigger the problem. The original plan that causes the leak involved right joins. This one only left joins. Even after ANALYZE tmp_bulk it still comes up with the same plan. And that plan isn't quick to succeed but also doesn't trigger the memory leak. So what I can tell is this: that growth to 1.5 GB is consistently
happening. It isn't just happening in the beginning and then the
rest is just a follow-up problem. Also there seems to be a final
spike in growth from 1.5 GB to 2.2 GB that happens inside a
second. That seems very strange. Back to the debugger and do a better job of conditional
breakpoints ... I already have an idea how I'll do that. I set a
flag when I enter the
Isn't 1.5 GB already way too big? There are 3 phases really.Anyway, the upshot is that you need to investigate what's happening while the memory consumption is increasing. The behavior before that starts to happen isn't going to be very interesting. It might be a bit tricky to catch that if it only takes a few seconds to blow up, but you could try "c 10000" or so to step through a lot of AllocSetAlloc calls, repeating till the bad stuff starts to happen, and then going back to looking at just where the calls are coming from.
I thought that slow massive growth is already a sign of a leak? I will now filter the calls that come through
ExecHashJoinGetSavedTuple I figured I can do this: (gdb) info frame Stack level 0, frame at 0x7ffcbf92fdd0: rip = 0x849030 in AllocSetAlloc (aset.c:718); saved rip = 0x84e7dd called by frame at 0x7ffcbf92fdf0 source language c. Arglist at 0x7ffcbf92fdc0, args: context=0x29a6450, size=371 Locals at 0x7ffcbf92fdc0, Previous frame's sp is 0x7ffcbf92fdd0 Saved registers: rip at 0x7ffcbf92fdc8 so is the saved $rip is 0x84e7dd then we are coming this way.
Therefore I set my new breakpoint like this: (gdb) b AllocSetAlloc if (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd Breakpoint 6 at 0x849030: file aset.c, line 718. (gdb) info b Num Type Disp Enb Address What 6 breakpoint keep y 0x0000000000849030 in AllocSetAlloc at aset.c:718 stop only if (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd And there we go: Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718 718 { (gdb) bt 8 #0 AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718 #1 0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969 #2 0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119 #3 0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138 #4 BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201 #5 0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at nodeHashjoin.c:1220 #6 0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=<optimized out>, hashvalue=<optimized out>) at nodeHash.c:1663 #7 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051 (More stack frames follow...) and on (gdb) info frame Stack level 0, frame at 0x7ffcbf92fd90: rip = 0x849030 in AllocSetAlloc (aset.c:718); saved rip = 0x84e8ad called by frame at 0x7ffcbf92fdb0 source language c. Arglist at 0x7ffcbf92fd80, args: context=0x29a6450, size=8 Locals at 0x7ffcbf92fd80, Previous frame's sp is 0x7ffcbf92fd90 Saved registers: rip at 0x7ffcbf92fd88 (gdb) b AllocSetAlloc if (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd && 0x84e8ad != *(int *)$rsp Note: breakpoint 6 also set at pc 0x849030. Breakpoint 7 at 0x849030: file aset.c, line 718. (gdb) delete 6 Now if I continue I don't seem to be stopping any more. Does this help now? -Gunther |