Re: Out of Memory errors are frustrating as heck!

Gunther <raj@xxxxxxxx> · Tue, 16 Apr 2019 01:23:45 -0400



    I saw your replies, if there was a way of using gdb commands to
      have a conditional breakpoint which will only fire if the n-th
      caller in the chain is not a certain source location, then one
      could exclude the bulk of these allocations and focus better. 

    
    But I decided I try to re-factor this query. And I made an
      interesting observation. 

    
    There is a left outer join in parenthesis 

    
    ... LEFT OUTER JOIN (SELECT ....) q ... 

    
    the biggest parenthesis. I turned this into a temporary table,
      tmp_bulk. Then I change the main query to 

    
    ... LEFT OUTER JOIN tmp_bulk q ...
    now I am running it again. But what I noticed is that the
      tmp_bulk table is tiny! It only has like 250 rows. So this means
      the vast majority of the right left rows in that join are
      unmatched. The plan is all different now. Heavy CPU% load. Must be
      merge sorting? No memory growth, not yet.

    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5394 postgres  20   0 1284448 287880 271528 R  99.3  3.6   9:21.83 postgres: postgres integrator [local] EXPLAIN
 5425 postgres  20   0 1278556  93184  82296 S  27.6  1.2   0:38.72 postgres: parallel worker for PID 5394

    No, I never trust when a database job has high CPU% and low IO
      for a long time. So I do
    SET ENABLE_MERGEJOIN TO OFF;
    and then do it again.  Now I have high IO and low CPU%. 

    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5394 postgres  20   0 1280904 282036 273616 D   2.3  3.6  13:01.51 postgres: postgres integrator [local] EXPLAIN
 5510 postgres  20   0 1278892  87524  80804 D   2.3  1.1   0:05.20 postgres: parallel worker for PID 5394
 5509 postgres  20   0 1278860  87484  80796 D   2.3  1.1   0:05.30 postgres: parallel worker for PID 5394

    Still I slip into the high CPU% situation, I guess I'll have to
      wait it out ...
    ... and still waiting. No growth whatsoever. The plan is now so
      totally different that it probably won't trigger the problem.
    The original plan that causes the leak involved right joins. This
      one only left joins. Even after ANALYZE tmp_bulk it still comes up
      with the same plan. And that plan isn't quick to succeed but also
      doesn't trigger the memory leak.
    So what I can tell is this: that growth to 1.5 GB is consistently
      happening. It isn't just happening in the beginning and then the
      rest is just a follow-up problem. Also there seems to be a final
      spike in growth from 1.5 GB to 2.2 GB that happens inside a
      second. That seems very strange. 

    
    Back to the debugger and do a better job of conditional
      breakpoints ... I already have an idea how I'll do that. I set a
      flag when I enter the 

    
        Anyway, the upshot is that you need to investigate what's happening
while the memory consumption is increasing.  The behavior before
that starts to happen isn't going to be very interesting.  It might
be a bit tricky to catch that if it only takes a few seconds to blow
up, but you could try "c 10000" or so to step through a lot of
AllocSetAlloc calls, repeating till the bad stuff starts to happen,
and then going back to looking at just where the calls are coming
from.

      
      Isn't 1.5 GB already way too big? There are 3 phases really. 
    
      steady state at less than 500 M
      slow massive growth to 1 G to 1.5 - 1.8 G
      explosion within 1 second from whatever the final size of slow
        massive growth to the final 2.2 G 

      
    I thought that slow massive growth is already a sign of a leak?
    I will now filter the calls that come through
      ExecHashJoinGetSavedTuple 

    
    I figured I can do this:
    (gdb) info frame
Stack level 0, frame at 0x7ffcbf92fdd0:
 rip = 0x849030 in AllocSetAlloc (aset.c:718); saved rip = 0x84e7dd
 called by frame at 0x7ffcbf92fdf0
 source language c.
 Arglist at 0x7ffcbf92fdc0, args: context=0x29a6450, size=371
 Locals at 0x7ffcbf92fdc0, Previous frame's sp is 0x7ffcbf92fdd0
 Saved registers:
  rip at 0x7ffcbf92fdc8

    so is the saved $rip is 0x84e7dd then we are coming this way.
      Therefore I set my new breakpoint like this:

    
    (gdb) b AllocSetAlloc if  (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd
Breakpoint 6 at 0x849030: file aset.c, line 718.
(gdb) info b
Num     Type           Disp Enb Address            What
6       breakpoint     keep y   0x0000000000849030 in AllocSetAlloc at aset.c:718
        stop only if  (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd
    And there we go:
    Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
718     {
(gdb) bt 8
#0  AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
#1  0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969
#2  0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119
#3  0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138
#4  BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201
#5  0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at nodeHashjoin.c:1220
#6  0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=<optimized out>, hashvalue=<optimized out>)
    at nodeHash.c:1663
#7  0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051
(More stack frames follow...)

    and on
    (gdb) info frame
Stack level 0, frame at 0x7ffcbf92fd90:
 rip = 0x849030 in AllocSetAlloc (aset.c:718); saved rip = 0x84e8ad
 called by frame at 0x7ffcbf92fdb0
 source language c.
 Arglist at 0x7ffcbf92fd80, args: context=0x29a6450, size=8
 Locals at 0x7ffcbf92fd80, Previous frame's sp is 0x7ffcbf92fd90
 Saved registers:
  rip at 0x7ffcbf92fd88
(gdb) b AllocSetAlloc if  (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd && 0x84e8ad != *(int *)$rsp
Note: breakpoint 6 also set at pc 0x849030.
Breakpoint 7 at 0x849030: file aset.c, line 718.
(gdb) delete 6

    Now if I continue I don't seem to be stopping any more.
    Does this help now?
    -Gunther