Hi guys. I don't want to be pushy, but I found it strange that
after so much lively back and forth getting to the bottom of this,
suddenly my last nights follow-up remained completely without
reply. I wonder if it even got received. For those who read their
emails with modern readers (I know I too am from a time where I
wrote everything in plain text) I marked some important questions
in bold. On 4/16/2019 11:30, Tom Lane wrote:
Breakpoint 6, AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718 718 { (gdb) bt 8 #0 AllocSetAlloc (context=0x29a6450, size=8) at aset.c:718 #1 0x000000000084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969 #2 0x0000000000702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119 #3 0x0000000000702e4c in makeBufFile (firstfile=68225) at buffile.c:138 #4 BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201 #5 0x000000000061060b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=<optimized out>, fileptr=0x6305b00) at nodeHashjoin.c:1220 #6 0x000000000060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=<optimized out>, hashvalue=<optimized out>) at nodeHash.c:1663 #7 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051Hmm ... this matches up with a vague thought I had that for some reason the hash join might be spawning a huge number of separate batches. Each batch would have a couple of files with associated in-memory state including an 8K I/O buffer, so you could account for the "slow growth" behavior you're seeing by periodic decisions to increase the number of batches. You might try watching calls to ExecHashIncreaseNumBatches and see if that theory holds water. OK, checking that ... well yes, this breaks quickly into that, here is one backtrace #0 ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893 #1 0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=slot@entry=0x2ae0238, hashvalue=<optimized out>) at nodeHash.c:1655 #2 0x000000000060fd9c in MultiExecPrivateHash (node=<optimized out>) at nodeHash.c:186 #3 MultiExecHash (node=node@entry=0x2ac6dc8) at nodeHash.c:114 #4 0x00000000005fe42f in MultiExecProcNode (node=node@entry=0x2ac6dc8) at execProcnode.c:501 #5 0x000000000061073d in ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:290 #6 ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565 #7 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461 #8 0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247 #9 ExecSort (pstate=0x2a1dc30) at nodeSort.c:107 #10 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461 #11 0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247 #12 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73 #13 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461 #14 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247 #15 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0) at execMain.c:1723 #16 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #17 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, es=es@entry=0x28f1048, queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbf930080) at explain.c:535 But this is still in the warm-up phase, we don't know if it is at
the place where memory grows too much.
This could only happen with a very unfriendly distribution of the hash keys, I think. There's a heuristic in there to shut off growth of nbatch if we observe that we're making no progress at all, but perhaps this is a skewed distribution that's not quite skewed enough to trigger that. Your hunch is pretty right on. There is something very weirdly distributed in this particular join situation. Let's see if I can count the occurrences ... I do cont 100. Now resident memory slowly grows, but not too much just 122 kB and CPU is at 88%. I think we haven't hit the problematic part of the plan. There is a sort merge at some leaf, which I believe is innocent. My gut feeling from looking at CPU% high that we are in one of those since NL is disabled.Next stage is that memory shot up to 264 kB and CPU% down to 8.6. Heavy IO (write and read). Yes! And now entering the 3rd stage, where memory shots up to 600 kB. This is where it starts "breaking out". And only now the 100 breakpoint conts are used up. And within a second another 100. And even 1000 go by in a second. cont 10000 goes by in 4 seconds. And during that time resident memory increased to over 700 kB. Let's measure: 736096 + cont 10000 --> 740056, that is 3960 bytes for 10000 conts, or 0.396 bytes per cont. Prediction: cont 10000 will now arrive at 744016? Aaaand ... BINGO! 744016 exactly! cont 50000 will take about 20 seconds and will boost memory to 763816. Bets are on ... drumroll ... 35, 36 , ... nope. This time didn't pan out. Breakpoint already hit 75727 times ignore next 5585 hits ... memory now 984052. So it took longer this time and memory increment was larger. We are now getting toward the edge of the cliff. Before we do here is the backtrace now: #0 ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893 #1 0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>) at nodeHash.c:1655 #2 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051 #3 ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539 #4 ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565 #5 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461 #6 0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247 #7 ExecSort (pstate=0x2a1dc30) at nodeSort.c:107 (More stack frames follow...) (gdb) bt 18 #0 ExecHashIncreaseNumBatches (hashtable=hashtable@entry=0x2ae8ca8) at nodeHash.c:893 #1 0x000000000060d84a in ExecHashTableInsert (hashtable=hashtable@entry=0x2ae8ca8, slot=<optimized out>, hashvalue=<optimized out>) at nodeHash.c:1655 #2 0x0000000000610c8f in ExecHashJoinNewBatch (hjstate=0x2a1dd40) at nodeHashjoin.c:1051 #3 ExecHashJoinImpl (parallel=false, pstate=0x2a1dd40) at nodeHashjoin.c:539 #4 ExecHashJoin (pstate=0x2a1dd40) at nodeHashjoin.c:565 #5 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dd40) at execProcnode.c:461 #6 0x000000000061ce6e in ExecProcNode (node=0x2a1dd40) at ../../../src/include/executor/executor.h:247 #7 ExecSort (pstate=0x2a1dc30) at nodeSort.c:107 #8 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1dc30) at execProcnode.c:461 #9 0x000000000061d2e4 in ExecProcNode (node=0x2a1dc30) at ../../../src/include/executor/executor.h:247 #10 ExecUnique (pstate=0x2a1d9b0) at nodeUnique.c:73 #11 0x00000000005fde88 in ExecProcNodeInstr (node=0x2a1d9b0) at execProcnode.c:461 #12 0x00000000005f75da in ExecProcNode (node=0x2a1d9b0) at ../../../src/include/executor/executor.h:247 #13 ExecutePlan (execute_once=<optimized out>, dest=0xcc60e0 <donothingDR>, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x2a1d9b0, estate=0x2a1d6c0) at execMain.c:1723 #14 standard_ExecutorRun (queryDesc=0x2a7a478, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:364 #15 0x000000000059c718 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x2a787f8, into=into@entry=0x0, es=es@entry=0x28f1048, queryString=<optimized out>, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7ffcbf930080) at explain.c:535 I ran the explain analyze of the plan while removing all the final result columns from the outer-most select, replacing them with simply SELECT 1 FROM .... And here is that plan. I am presenting it to you because you might glean something about the whatever skewed distribution. Hash Right Join (cost=4203858.53..5475530.71 rows=34619 width=4) (actual time=309603.384..459480.863 rows=113478386 loops=1) Hash Cond: (((q.documentinternalid)::text = (documentinformationsubject.documentinternalid)::text) AND ((r.targetinternalid)::text = (documentinformationsubject.actinternalid)::text)) -> Hash Right Join (cost=1341053.37..2611158.36 rows=13 width=74) (actual time=109807.980..109808.040 rows=236 loops=1) Hash Cond: (((documentinformationsubject_2.documentinternalid)::text = (q.documentinternalid)::text) AND ((documentinformationsubject_2.actinternalid)::text = (q.actinternalid)::text)) -> Gather (cost=30803.54..1300908.52 rows=1 width=74) (actual time=58730.915..58737.757 rows=0 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Left Join (cost=29803.54..1299908.42 rows=1 width=74) (actual time=58723.378..58723.379 rows=0 loops=3) Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyid.entityinternalid)::text) -> Parallel Hash Left Join (cost=28118.13..1298223.00 rows=1 width=111) (actual time=58713.650..58713.652 rows=0 loops=3) Hash Cond: ((documentinformationsubject_2.otherentityinternalid)::text = (agencyname.entityinternalid)::text) -> Parallel Seq Scan on documentinformationsubject documentinformationsubject_2 (cost=0.00..1268800.85 rows=1 width=111) (actual time=58544.391..58544.391 rows=0 loops=3) Filter: ((participationtypecode)::text = 'AUT'::text) Rows Removed by Filter: 2815562 -> Parallel Hash (cost=24733.28..24733.28 rows=166628 width=37) (actual time=125.611..125.611 rows=133303 loops=3) Buckets: 65536 Batches: 16 Memory Usage: 2336kB -> Parallel Seq Scan on bestname agencyname (cost=0.00..24733.28 rows=166628 width=37) (actual time=0.009..60.685 rows=133303 loops=3) -> Parallel Hash (cost=1434.07..1434.07 rows=20107 width=37) (actual time=9.329..9.329 rows=11393 loops=3) Buckets: 65536 Batches: 1 Memory Usage: 2976kB -> Parallel Seq Scan on entity_id agencyid (cost=0.00..1434.07 rows=20107 width=37) (actual time=0.008..5.224 rows=11393 loops=3) -> Hash (cost=1310249.63..1310249.63 rows=13 width=111) (actual time=51077.049..51077.049 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 41kB -> Hash Right Join (cost=829388.20..1310249.63 rows=13 width=111) (actual time=45607.852..51076.967 rows=236 loops=1) Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=37) (actual time=1.212..10883.350 rows=14676871 loops=1) -> Hash (cost=829388.19..829388.19 rows=1 width=111) (actual time=38246.715..38246.715 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 41kB -> Gather (cost=381928.46..829388.19 rows=1 width=111) (actual time=31274.733..38246.640 rows=236 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Join (cost=380928.46..828388.09 rows=1 width=111) (actual time=31347.260..38241.812 rows=79 loops=3) Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text) -> Parallel Seq Scan on documentinformation q (cost=0.00..447271.93 rows=50050 width=74) (actual time=13304.439..20265.733 rows=87921 loops=3) Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) Rows Removed by Filter: 1540625 -> Parallel Hash (cost=380928.44..380928.44 rows=1 width=74) (actual time=17954.106..17954.106 rows=79 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 104kB -> Parallel Seq Scan on actrelationship r (cost=0.00..380928.44 rows=1 width=74) (actual time=7489.704..17953.959 rows=79 loops=3) Filter: ((typecode)::text = 'SUBJ'::text) Rows Removed by Filter: 3433326 -> Hash (cost=2861845.87..2861845.87 rows=34619 width=74) (actual time=199792.446..199792.446 rows=113478127 loops=1) Buckets: 65536 (originally 65536) Batches: 131072 (originally 2) Memory Usage: 189207kB -> Gather Merge (cost=2845073.40..2861845.87 rows=34619 width=74) (actual time=107620.262..156256.432 rows=113478127 loops=1) Workers Planned: 2 Workers Launched: 2 -> Merge Left Join (cost=2844073.37..2856849.96 rows=14425 width=74) (actual time=107570.719..126113.792 rows=37826042 loops=3) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1295969.26..1296005.32 rows=14425 width=111) (actual time=57700.723..58134.751 rows=231207 loops=3) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalid Sort Method: external merge Disk: 26936kB Worker 0: Sort Method: external merge Disk: 27152kB Worker 1: Sort Method: external merge Disk: 28248kB -> Parallel Seq Scan on documentinformationsubject (cost=0.00..1294972.76 rows=14425 width=111) (actual time=24866.656..57424.420 rows=231207 loops=3) Filter: (((participationtypecode)::text = ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND ((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) Rows Removed by Filter: 2584355 -> Materialize (cost=1548104.12..1553157.04 rows=1010585 width=111) (actual time=49869.984..54191.701 rows=38060250 loops=3) -> Sort (cost=1548104.12..1550630.58 rows=1010585 width=111) (actual time=49869.980..50832.205 rows=1031106 loops=3) Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid Sort Method: external merge Disk: 122192kB Worker 0: Sort Method: external merge Disk: 122192kB Worker 1: Sort Method: external merge Disk: 122192kB -> Seq Scan on documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64 rows=1010585 width=111) (actual time=20366.166..47751.267 rows=1031106 loops=3) Filter: ((participationtypecode)::text = 'PRD'::text) Rows Removed by Filter: 7415579 Planning Time: 2.523 ms Execution Time: 464825.391 ms (66 rows) By the way, let me ask, do you have pretty-print functions I can call with, e.g., node in ExecProcNode, or pstate in ExecHashJoin? Because if there was, then we could determine where exactly in the current plan we are? And can I call the plan printer for the entire plan we are currently executing? Might it even give us preliminary counts of where in the process it is? (I ask the latter not only because it would be really useful for our present debugging, but also because it would be an awesome tool for monitoring of long running queries! Something I am sure tons of people would just love to have!) I also read the other responses. I agree that having a swap space
available just in case is better than these annoying out of memory
errors. And yes, I can add that memory profiler thing, if you
think it would actually work. I've done it with java heap dumps,
even upgrading the VM to a 32 GB VM just to crunch the heap dump.
But can you tell me just a little more as to how I need to
configure this thing to get the data you want without blowing up
the memory and disk during this huge query? regards, |