Re: Out of Memory errors are frustrating as heck!

Gunther <raj@xxxxxxxx> · Tue, 16 Apr 2019 22:24:53 -0400

    On 4/16/2019 11:30, Tom Lane wrote:

        Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
718     {
(gdb) bt 8
#0  AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718
#1  0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969
#2  0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119
#3  0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138
#4  BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201
#5  0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at nodeHashjoin.c:1220
#6  0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=<optimized out>, hashvalue=<optimized out>)
     at nodeHash.c:1663
#7  0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051

      Hmm ... this matches up with a vague thought I had that for some reason
the hash join might be spawning a huge number of separate batches.
Each batch would have a couple of files with associated in-memory
state including an 8K I/O buffer, so you could account for the
"slow growth" behavior you're seeing by periodic decisions to
increase the number of batches.

You might try watching calls to ExecHashIncreaseNumBatches
and see if that theory holds water.

    OK, checking that ... well yes, this breaks quickly into that,
      here is one backtrace

      This could only happen with a very unfriendly distribution of the
hash keys, I think.  There's a heuristic in there to shut off
growth of nbatch if we observe that we're making no progress at
all, but perhaps this is a skewed distribution that's not quite
skewed enough to trigger that.

    Your hunch is pretty right on. There is something very weirdly
      distributed in this particular join situation.
    #0  ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893
#1  0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=slot@entry=0x2ae0238,
    hashvalue=<optimized out>) at nodeHash.c:1655
#2  0x000000000060fd9c in MultiExecPrivateHash (node=<optimized out>) at nodeHash.c:186
#3  MultiExecHash (node=node@entry=0x2ac6dc8) at nodeHash.c:114
#4  0x00000000005fe42f in MultiExecProcNode (node=node@entry=0x2ac6dc8) at execProcnode.c:501
#5  0x000000000061073d in ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:290
#6  ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#7  0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461
#8  0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247
#9  ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
#10 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461
#11 0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247
#12 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73
#13 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461
#14 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247
#15 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,
    sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0)
    at execMain.c:1723
#16 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#17 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, es=es@entry=0x28f1048,
    queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbf930080) at explain.c:535
    But this is still in the warm-up phase, we don't know if it is at
      the place where memory grows too much.

    Let's see if I can count the occurrences ... I do cont 100. Now
      resident memory slowly grows, but not too much just 122 kB and CPU
      is at 88%. I think we haven't hit the problematic part of the
      plan.  There is a sort merge at some leaf, which I believe is
      innocent. My gut feeling from looking at CPU% high that we are in
      one of those since NL is disabled.

    Next stage is that memory shot up to 264 kB and CPU% down to
      8.6.  Heavy IO (write and read).
    Yes! And now entering the 3rd stage, where memory shots up to 600
      kB. This is where it starts "breaking out". And only now the 100
      breakpoint conts are used up. And within a second another 100. And
      even 1000 go by in a second. cont 10000 goes by in 4 seconds. And
      during that time resident memory increased to over 700 kB. Let's
      measure:
    736096 + cont 10000 --> 740056, that is 3960 bytes for 10000
      conts, or 0.396 bytes per cont. Prediction: cont 10000 will now
      arrive at 744016? Aaaand ... BINGO! 744016 exactly! cont 50000
      will take about 20 seconds and will boost memory to 763816. Bets
      are on ... drumroll ... 35, 36 , ... nope. This time didn't pan
      out. Breakpoint already hit 75727 times ignore next 5585 hits ...
      memory now 984052. So it took longer this time and memory
      increment was larger. We are now getting toward the edge of the
      cliff. Before we do here is the backtrace now:
    #0  ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893
#1  0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>)
    at nodeHash.c:1655
#2  0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051
#3  ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539
#4  ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#5  0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461
#6  0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247
#7  ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
(More stack frames follow...)
(gdb) bt 18
#0  ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893
#1  0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>)
    at nodeHash.c:1655
#2  0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051
#3  ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539
#4  ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565
#5  0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461
#6  0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247
#7  ExecSort (pstate=0x2a1dc30) at nodeSort.c:107
#8  0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461
#9  0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247
#10 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73
#11 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461
#12 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247
#13 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0,
    sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0)
    at execMain.c:1723
#14 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364
#15 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, es=es@entry=0x28f1048,
    queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbf930080) at explain.c:535
    By the way, I ran the explain analyze of the plan while removing
      all the final result columns from the outer-most select, replacing
      them with simply SELECT 1 FROM .... And here is that plan. I am
      presenting it to you because you might glean something about the
      whatever skewed distribution.
     Hash Right Join  (cost=4203858.53..5475530.71 rows=34619 width=4) (actual time=309603.384..459480.863 rows=113478386 loops=1)
   Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text))
   ->  Hash Right Join  (cost=1341053.37..2611158.36 rows=13 width=74) (actual time=109807.980..109808.040 rows=236 loops=1)
         Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text))
         ->  Gather  (cost=30803.54..1300908.52 rows=1 width=74) (actual time=58730.915..58737.757 rows=0 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Parallel Hash Left Join  (cost=29803.54..1299908.42 rows=1 width=74) (actual time=58723.378..58723.379 rows=0 loops=3)
                     Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyid.entityinternalid)::text)
                     ->  Parallel Hash Left Join  (cost=28118.13..1298223.00 rows=1 width=111) (actual time=58713.650..58713.652 rows=0 loops=3)
                           Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyname.entityinternalid)::text)
                           ->  Parallel Seq Scan on documentinformationsubject documentinformationsubject_2  (cost=0.00..1268800.85 rows=1 width=111) (actual time=58544.391..58544.391 rows=0 loops=3)
                                 Filter: ((participationtypecode)::text = 'AUT'::text)
                                 Rows Removed by Filter: 2815562
                           ->  Parallel Hash  (cost=24733.28..24733.28 rows=166628 width=37) (actual time=125.611..125.611 rows=133303 loops=3)
                                 Buckets: 65536  Batches: 16  Memory Usage: 2336kB
                                 ->  Parallel Seq Scan on bestname agencyname  (cost=0.00..24733.28 rows=166628 width=37) (actual time=0.009..60.685 rows=133303 loops=3)
                     ->  Parallel Hash  (cost=1434.07..1434.07 rows=20107 width=37) (actual time=9.329..9.329 rows=11393 loops=3)
                           Buckets: 65536  Batches: 1  Memory Usage: 2976kB
                           ->  Parallel Seq Scan on entity_id agencyid  (cost=0.00..1434.07 rows=20107 width=37) (actual time=0.008..5.224 rows=11393 loops=3)
         ->  Hash  (cost=1310249.63..1310249.63 rows=13 width=111) (actual time=51077.049..51077.049 rows=236 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 41kB
               ->  Hash Right Join  (cost=829388.20..1310249.63 rows=13 width=111) (actual time=45607.852..51076.967 rows=236 loops=1)
                     Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text)
                     ->  Seq Scan on act_id an  (cost=0.00..425941.04 rows=14645404 width=37) (actual time=1.212..10883.350 rows=14676871 loops=1)
                     ->  Hash  (cost=829388.19..829388.19 rows=1 width=111) (actual time=38246.715..38246.715 rows=236 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 41kB
                           ->  Gather  (cost=381928.46..829388.19 rows=1 width=111) (actual time=31274.733..38246.640 rows=236 loops=1)
                                 Workers Planned: 2
                                 Workers Launched: 2
                                 ->  Parallel Hash Join  (cost=380928.46..828388.09 rows=1 width=111) (actual time=31347.260..38241.812 rows=79 loops=3)
                                       Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text)
                                       ->  Parallel Seq Scan on documentinformation q  (cost=0.00..447271.93 rows=50050 width=74) (actual time=13304.439..20265.733 rows=87921 loops=3)
                                             Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))
                                             Rows Removed by Filter: 1540625
                                       ->  Parallel Hash  (cost=380928.44..380928.44 rows=1 width=74) (actual time=17954.106..17954.106 rows=79 loops=3)
                                             Buckets: 1024  Batches: 1  Memory Usage: 104kB
                                             ->  Parallel Seq Scan on actrelationship r  (cost=0.00..380928.44 rows=1 width=74) (actual time=7489.704..17953.959 rows=79 loops=3)
                                                   Filter: ((typecode)::text = 'SUBJ'::text)
                                                   Rows Removed by Filter: 3433326
   ->  Hash  (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127 loops=1)
         Buckets: 65536 (originally 65536)  Batches: 131072 (originally 2)  Memory Usage: 189207kB
         ->  Gather Merge  (cost=2845073.40..2861845.87 rows=34619 width=74) (actual time=107620.262..156256.432 rows=113478127 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Merge Left Join  (cost=2844073.37..2856849.96 rows=14425 width=74) (actual time=107570.719..126113.792 rows=37826042 loops=3)
                     Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text))
                     ->  Sort  (cost=1295969.26..1296005.32 rows=14425 width=111) (actual time=57700.723..58134.751 rows=231207 loops=3)
                           Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid
                           Sort Method: external merge  Disk: 26936kB
                           Worker 0:  Sort Method: external merge  Disk: 27152kB
                           Worker 1:  Sort Method: external merge  Disk: 28248kB
                           ->  Parallel Seq Scan on documentinformationsubject  (cost=0.00..1294972.76 rows=14425 width=111) (actual time=24866.656..57424.420 rows=231207 loops=3)
                                 Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))
                                 Rows Removed by Filter: 2584355
                     ->  Materialize  (cost=1548104.12..1553157.04 rows=1010585 width=111) (actual time=49869.984..54191.701 rows=38060250 loops=3)
                           ->  Sort  (cost=1548104.12..1550630.58 rows=1010585 width=111) (actual time=49869.980..50832.205 rows=1031106 loops=3)
                                 Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid
                                 Sort Method: external merge  Disk: 122192kB
                                 Worker 0:  Sort Method: external merge  Disk: 122192kB
                                 Worker 1:  Sort Method: external merge  Disk: 122192kB
                                 ->  Seq Scan on documentinformationsubject documentinformationsubject_1  (cost=0.00..1329868.64 rows=1010585 width=111) (actual time=20366.166..47751.267 rows=1031106 loops=3)
                                       Filter: ((participationtypecode)::text = 'PRD'::text)
                                       Rows Removed by Filter: 7415579
 Planning Time: 2.523 ms
 Execution Time: 464825.391 ms
(66 rows)
    By the way, let me ask, do you have pretty-print functions I can
      call with, e.g., node in ExecProcNode, or pstate in ExecHashJoin?
      Because if there was, then we could determine where exactly in the
      current plan we are? And can I call the plan printer for the
      entire plan we are currently executing? Might it even give us
      preliminary counts of where in the process it is? (I ask the
      latter not only because it would be really useful for our present
      debugging, but also because it would be an awesome tool for
      monitoring of long running queries! Something I am sure tons of
      people would just love to have!
    BTW, I also read the other responses. I agree that having a swap
      space available just in case is better than these annoying out of
      memory errors. And yes, I can add that memory profiler thing, if
      you think it would actually work. I've done it with java heap
      dumps, even upgrading the VM to a 32 GB VM just to crunch the heap
      dump. But can you tell me just a little more as to how I need to
      configure this thing to get the data you want without blowing up
      the memory and disk during this huge query?

    regards,

      -Gunther