Now to Justin's patch. First undo Tomas' patch and apply:$ mv src/include/executor/hashjoin.h.orig src/include/executor/hashjoin.h $ mv src/backend/executor/nodeHash.c.orig src/backend/executor/nodeHash.c $ mv src/backend/executor/nodeHashjoin.c.orig src/backend/executor/nodeHashjoin.c $ patch -p1 <../limit-hash-nbatches-v2.patch patching file src/backend/executor/nodeHash.c Hunk #1 succeeded at 570 (offset -3 lines). Hunk #2 succeeded at 917 (offset -3 lines). Hunk #3 succeeded at 930 (offset -3 lines). Hunk #4 succeeded at 1037 (offset -3 lines). Hunk #5 succeeded at 1658 (offset -4 lines). $ make $ make install $ pg_ctl -c restart and go ... lots of CPU% again and very limited memory use as of yet. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11054 postgres 20 0 1302772 90316 58004 R 94.4 1.1 4:38.05 postgres: postgres integrator [local] EXPLAIN 11055 postgres 20 0 1280532 68076 57168 R 97.7 0.9 2:03.54 postgres: parallel worker for PID 11054 11056 postgres 20 0 1280532 67964 57124 S 0.0 0.9 2:08.28 postgres: parallel worker for PID 11054 that's a pretty decent sign so far. Slight increase ... but still
relatively steady ... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11054 postgres 20 0 1379704 167140 58004 R 95.0 2.1 5:56.28 postgres: postgres integrator [local] EXPLAIN 11055 postgres 20 0 1280532 68076 57168 S 25.6 0.9 2:36.59 postgres: parallel worker for PID 11054 11056 postgres 20 0 1280532 67964 57124 R 61.8 0.9 2:29.65 postgres: parallel worker for PID 11054 aaand break out ... PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11119 postgres 20 0 1271660 1.0g 1.0g D 0.0 13.4 0:03.10 postgres: parallel worker for PID 11054 11054 postgres 20 0 1380940 1.0g 950508 D 0.0 13.4 6:56.09 postgres: postgres integrator [local] EXPLAIN 11118 postgres 20 0 1271660 884540 882724 D 0.0 11.2 0:02.84 postgres: parallel worker for PID 11054 and crash: foo=# explain analyze select * from reports.v_BusinessOperation; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> what happened? ouch, no space left on root device, too much logging? Maybe the core dump ... Log file content is simple: 2019-04-22 00:07:56.104 UTC [11048] LOG: database system was shut down at 2019-04-22 00:07:55 UTC 2019-04-22 00:07:56.108 UTC [11046] LOG: database system is ready to accept connections Hashjoin 0x2122458: initial nbatch = 16, nbuckets = 8192 Hashjoin 0x2122458: increasing nbatch to 32 because space = 4128933 Hashjoin 0x2122458: freed 148 of 10584 tuples, space now 4071106 Hashjoin 0x2122458: increasing nbatch to 64 because space = 4128826 Hashjoin 0x2122458: freed 544 of 10584 tuples, space now 3916296 Hashjoin 0x2122458: increasing nbatch to 128 because space = 4128846 Hashjoin 0x2122458: freed 10419 of 10585 tuples, space now 65570 Hashjoin 0x2122458: increasing nbatch to 256 because space = 4128829 Hashjoin 0x2122458: freed 10308 of 10734 tuples, space now 161815 Hashjoin 0x2122458: increasing nbatch to 512 because space = 4128908 Hashjoin 0x2122458: freed 398 of 10379 tuples, space now 3977787 Hashjoin 0x3ac9918: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x3ac91a8: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x3ac93c8: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x1f41018: initial nbatch = 1, nbuckets = 1024 Hashjoin 0x1f41048: initial nbatch = 1, nbuckets = 1024 2019-04-22 00:16:55.273 UTC [11046] LOG: background worker "parallel worker" (PID 11119) was terminated by signal 11: Segmentation fault 2019-04-22 00:16:55.273 UTC [11046] DETAIL: Failed process was running: explain analyze select * from reports.v_BusinessOperation; 2019-04-22 00:16:55.273 UTC [11046] LOG: terminating any other active server processes 2019-04-22 00:16:55.274 UTC [11058] WARNING: terminating connection because of crash of another server process 2019-04-22 00:16:55.274 UTC [11058] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2019-04-22 00:16:55.274 UTC [11058] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2019-04-22 00:16:55.277 UTC [11052] LOG: could not write temporary statistics file "pg_stat/db_16903.tmp": No space left on device 2019-04-22 00:16:55.278 UTC [11052] LOG: could not close temporary statistics file "pg_stat/db_0.tmp": No space left on device 2019-04-22 00:16:55.278 UTC [11052] LOG: could not close temporary statistics file "pg_stat/global.tmp": No space left on device 2019-04-22 00:16:55.315 UTC [11046] LOG: all server processes terminated; reinitializing 2019-04-22 00:16:55.425 UTC [11123] LOG: database system was interrupted; last known up at 2019-04-22 00:12:56 UTC 2019-04-22 00:16:55.426 UTC [11124] FATAL: the database system is in recovery mode 2019-04-22 00:16:55.545 UTC [11123] LOG: database system was not properly shut down; automatic recovery in progress 2019-04-22 00:16:55.549 UTC [11123] LOG: redo starts at 3D2/C44FDCF8 ok it is all because it dumped 3 core dumps, glad that I captured
the top lines of the backend and its 2 workers -rw------- 1 postgres postgres 1075843072 Apr 22 00:16 core.11054
-- backend And the melt down starts with "parallel worker" (PID 11119) receiving SIGSEGV. So let's get gdb to the task to see what's up: $ gdb -c data/core.11119 postgresql-11.2/src/backend/postgres ... Reading symbols from postgresql-11.2/src/backend/postgres...done. BFD: Warning: /var/lib/pgsql/data/core.11119 is truncated: expected core file size >= 1127112704, found: 1079103488. [New LWP 11119] Cannot access memory at address 0x7ff8d25dc108 Cannot access memory at address 0x7ff8d25dc100 Failed to read a valid object file image from memory. Core was generated by `postgres: parallel worker for'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00000000006bd792 in ExecParallelHashJoinNewBatch ( hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127 1127 { (gdb) bt 8 #0 0x00000000006bd792 in ExecParallelHashJoinNewBatch ( hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127 Backtrace stopped: Cannot access memory at address 0x7ffd45fa9c88 (gdb) info frame Stack level 0, frame at 0x7ffd45fa9c90: rip = 0x6bd792 in ExecParallelHashJoinNewBatch (nodeHashjoin.c:1127); saved rip = <not saved> Outermost frame: Cannot access memory at address 0x7ffd45fa9c88 source language c. Arglist at 0x7ffd45fa9c80, args: hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38> Locals at 0x7ffd45fa9c80, Previous frame's sp is 0x7ffd45fa9c90 Cannot access memory at address 0x7ffd45fa9c80 (gdb) list 1122 SharedTuplestoreAccessor *inner_tuples; 1123 Barrier *batch_barrier = 1124 &hashtable->batches[batchno].shared->batch_barrier; 1125 1126 switch (BarrierAttach(batch_barrier)) 1127 { 1128 case PHJ_BATCH_ELECTING: 1129 1130 /* One backend allocates the hash table. */ 1131 if (BarrierArriveAndWait(batch_barrier, unfortunately this core file is truncated because of the file system running out of space. Let's see the others. $ gdb -c data/core.11118 postgresql-11.2/src/backend/postgres ... Reading symbols from postgresql-11.2/src/backend/postgres...done. BFD: Warning: /var/lib/pgsql/data/core.11118 is truncated: expected core file size >= 1127112704, found: 894640128. [New LWP 11118] Cannot access memory at address 0x7ff8d25dc108 Cannot access memory at address 0x7ff8d25dc100 Failed to read a valid object file image from memory. Core was generated by `postgres: parallel worker for'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00000000006bd792 in ExecParallelHashJoinNewBatch ( hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127 1127 { (gdb) bt 5 #0 0x00000000006bd792 in ExecParallelHashJoinNewBatch ( hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127 Backtrace stopped: Cannot access memory at address 0x7ffd45fa9c88 (gdb) list 1122 SharedTuplestoreAccessor *inner_tuples; 1123 Barrier *batch_barrier = 1124 &hashtable->batches[batchno].shared->batch_barrier; 1125 1126 switch (BarrierAttach(batch_barrier)) 1127 { 1128 case PHJ_BATCH_ELECTING: 1129 1130 /* One backend allocates the hash table. */ 1131 if (BarrierArriveAndWait(batch_barrier, strange, that one must have died very similar, same place, also truncated. $ gdb -c data/core.11054 postgresql-11.2/src/backend/postgres ... Reading symbols from postgresql-11.2/src/backend/postgres...done. BFD: Warning: /var/lib/pgsql/data/core.11054 is truncated: expected core file size >= 1238786048, found: 1075843072. [New LWP 11054] Cannot access memory at address 0x7ff8d25dc108 Cannot access memory at address 0x7ff8d25dc100 Failed to read a valid object file image from memory. Core was generated by `postgres: postgres integrator'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00000000006bd792 in ExecParallelHashJoinNewBatch ( hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9498>) at nodeHashjoin.c:1127 1127 { ( I don't understand why all of them are at the same location. Doesn't make any sense to me. But I'll leave it at that right now. -Gunther |