Re: Out of Memory errors are frustrating as heck!

Gunther <raj@xxxxxxxx> · Sun, 21 Apr 2019 20:55:03 -0400



    Now to Justin's patch. First undo Tomas' patch and apply:

    $ mv src/include/executor/hashjoin.h.orig src/include/executor/hashjoin.h
$ mv src/backend/executor/nodeHash.c.orig src/backend/executor/nodeHash.c
$ mv src/backend/executor/nodeHashjoin.c.orig src/backend/executor/nodeHashjoin.c
$ patch -p1 <../limit-hash-nbatches-v2.patch
patching file src/backend/executor/nodeHash.c
Hunk #1 succeeded at 570 (offset -3 lines).
Hunk #2 succeeded at 917 (offset -3 lines).
Hunk #3 succeeded at 930 (offset -3 lines).
Hunk #4 succeeded at 1037 (offset -3 lines).
Hunk #5 succeeded at 1658 (offset -4 lines).

$ make 
$ make install
$ pg_ctl -c restart

    and go ... 

    
    lots of CPU% again and very limited memory use as of yet.
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11054 postgres  20   0 1302772  90316  58004 R  94.4  1.1   4:38.05 postgres: postgres integrator [local] EXPLAIN
11055 postgres  20   0 1280532  68076  57168 R  97.7  0.9   2:03.54 postgres: parallel worker for PID 11054
11056 postgres  20   0 1280532  67964  57124 S   0.0  0.9   2:08.28 postgres: parallel worker for PID 11054

    that's a pretty decent sign so far. Slight increase ... but still
      relatively steady ...

    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11054 postgres  20   0 1379704 167140  58004 R  95.0  2.1   5:56.28 postgres: postgres integrator [local] EXPLAIN
11055 postgres  20   0 1280532  68076  57168 S  25.6  0.9   2:36.59 postgres: parallel worker for PID 11054
11056 postgres  20   0 1280532  67964  57124 R  61.8  0.9   2:29.65 postgres: parallel worker for PID 11054

    aaand break out ...

    
      PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11119 postgres  20   0 1271660   1.0g   1.0g D   0.0 13.4   0:03.10 postgres: parallel worker for PID 11054
11054 postgres  20   0 1380940   1.0g 950508 D   0.0 13.4   6:56.09 postgres: postgres integrator [local] EXPLAIN
11118 postgres  20   0 1271660 884540 882724 D   0.0 11.2   0:02.84 postgres: parallel worker for PID 11054
    and crash:
    foo=# explain analyze select * from reports.v_BusinessOperation;
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

    what happened? ouch, no space left on root device, too much
      logging? Maybe the core dump ... Log file content is simple:
    2019-04-22 00:07:56.104 UTC [11048] LOG:  database system was shut down at 2019-04-22 00:07:55 UTC
2019-04-22 00:07:56.108 UTC [11046] LOG:  database system is ready to accept connections
Hashjoin 0x2122458: initial nbatch = 16, nbuckets = 8192
Hashjoin 0x2122458: increasing nbatch to 32 because space = 4128933
Hashjoin 0x2122458: freed 148 of 10584 tuples, space now 4071106
Hashjoin 0x2122458: increasing nbatch to 64 because space = 4128826
Hashjoin 0x2122458: freed 544 of 10584 tuples, space now 3916296
Hashjoin 0x2122458: increasing nbatch to 128 because space = 4128846
Hashjoin 0x2122458: freed 10419 of 10585 tuples, space now 65570
Hashjoin 0x2122458: increasing nbatch to 256 because space = 4128829
Hashjoin 0x2122458: freed 10308 of 10734 tuples, space now 161815
Hashjoin 0x2122458: increasing nbatch to 512 because space = 4128908
Hashjoin 0x2122458: freed 398 of 10379 tuples, space now 3977787
Hashjoin 0x3ac9918: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x3ac91a8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x3ac93c8: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1f41018: initial nbatch = 1, nbuckets = 1024
Hashjoin 0x1f41048: initial nbatch = 1, nbuckets = 1024
2019-04-22 00:16:55.273 UTC [11046] LOG:  background worker "parallel worker" (PID 11119) was terminated by signal 11: Segmentation fault
2019-04-22 00:16:55.273 UTC [11046] DETAIL:  Failed process was running: explain analyze select * from reports.v_BusinessOperation;
2019-04-22 00:16:55.273 UTC [11046] LOG:  terminating any other active server processes
2019-04-22 00:16:55.274 UTC [11058] WARNING:  terminating connection because of crash of another server process
2019-04-22 00:16:55.274 UTC [11058] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2019-04-22 00:16:55.274 UTC [11058] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2019-04-22 00:16:55.277 UTC [11052] LOG:  could not write temporary statistics file "pg_stat/db_16903.tmp": No space left on device
2019-04-22 00:16:55.278 UTC [11052] LOG:  could not close temporary statistics file "pg_stat/db_0.tmp": No space left on device
2019-04-22 00:16:55.278 UTC [11052] LOG:  could not close temporary statistics file "pg_stat/global.tmp": No space left on device
2019-04-22 00:16:55.315 UTC [11046] LOG:  all server processes terminated; reinitializing
2019-04-22 00:16:55.425 UTC [11123] LOG:  database system was interrupted; last known up at 2019-04-22 00:12:56 UTC
2019-04-22 00:16:55.426 UTC [11124] FATAL:  the database system is in recovery mode
2019-04-22 00:16:55.545 UTC [11123] LOG:  database system was not properly shut down; automatic recovery in progress
2019-04-22 00:16:55.549 UTC [11123] LOG:  redo starts at 3D2/C44FDCF8

    ok it is all because it dumped 3 core dumps, glad that I captured
      the top lines of the backend and its 2 workers

    
    -rw------- 1 postgres postgres 1075843072 Apr 22 00:16 core.11054
      -- backend

      -rw------- 1 postgres postgres  894640128 Apr 22 00:16 core.11118
      -- worker 1

      -rw------- 1 postgres postgres 1079103488 Apr 22 00:16 core.11119
      -- worker 2

    
    And the melt down starts with "parallel worker" (PID 11119)
      receiving SIGSEGV.
    So let's get gdb to the task to see what's up:
    $ gdb -c data/core.11119 postgresql-11.2/src/backend/postgres
...
Reading symbols from postgresql-11.2/src/backend/postgres...done.
BFD: Warning: /var/lib/pgsql/data/core.11119 is truncated: expected core file size >= 1127112704, found: 1079103488.
[New LWP 11119]
Cannot access memory at address 0x7ff8d25dc108
Cannot access memory at address 0x7ff8d25dc100
Failed to read a valid object file image from memory.
Core was generated by `postgres: parallel worker for'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (
    hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
1127                            {
(gdb) bt 8
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (
    hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
Backtrace stopped: Cannot access memory at address 0x7ffd45fa9c88
(gdb) info frame
Stack level 0, frame at 0x7ffd45fa9c90:
 rip = 0x6bd792 in ExecParallelHashJoinNewBatch (nodeHashjoin.c:1127); saved rip = <not saved>
 Outermost frame: Cannot access memory at address 0x7ffd45fa9c88
 source language c.
 Arglist at 0x7ffd45fa9c80, args: hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>
 Locals at 0x7ffd45fa9c80, Previous frame's sp is 0x7ffd45fa9c90
Cannot access memory at address 0x7ffd45fa9c80
(gdb) list
1122                            SharedTuplestoreAccessor *inner_tuples;
1123                            Barrier    *batch_barrier =
1124                            &hashtable->batches[batchno].shared->batch_barrier;
1125
1126                            switch (BarrierAttach(batch_barrier))
1127                            {
1128                                    case PHJ_BATCH_ELECTING:
1129
1130                                            /* One backend allocates the hash table. */
1131                                            if (BarrierArriveAndWait(batch_barrier,

    unfortunately this core file is truncated because of the file
      system running out of space. Let's see the others.
    $ gdb -c data/core.11118 postgresql-11.2/src/backend/postgres
...
Reading symbols from postgresql-11.2/src/backend/postgres...done.
BFD: Warning: /var/lib/pgsql/data/core.11118 is truncated: expected core file size >= 1127112704, found: 894640128.
[New LWP 11118]
Cannot access memory at address 0x7ff8d25dc108
Cannot access memory at address 0x7ff8d25dc100
Failed to read a valid object file image from memory.
Core was generated by `postgres: parallel worker for'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (
    hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
1127                            {
(gdb) bt 5
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (
    hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9c38>) at nodeHashjoin.c:1127
Backtrace stopped: Cannot access memory at address 0x7ffd45fa9c88
(gdb) list
1122                            SharedTuplestoreAccessor *inner_tuples;
1123                            Barrier    *batch_barrier =
1124                            &hashtable->batches[batchno].shared->batch_barrier;
1125
1126                            switch (BarrierAttach(batch_barrier))
1127                            {
1128                                    case PHJ_BATCH_ELECTING:
1129
1130                                            /* One backend allocates the hash table. */
1131                                            if (BarrierArriveAndWait(batch_barrier,

    strange, that one must have died very similar, same place, also
      truncated.
    $ gdb -c data/core.11054 postgresql-11.2/src/backend/postgres
...
Reading symbols from postgresql-11.2/src/backend/postgres...done.
BFD: Warning: /var/lib/pgsql/data/core.11054 is truncated: expected core file size >= 1238786048, found: 1075843072.
[New LWP 11054]
Cannot access memory at address 0x7ff8d25dc108
Cannot access memory at address 0x7ff8d25dc100
Failed to read a valid object file image from memory.
Core was generated by `postgres: postgres integrator'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000000006bd792 in ExecParallelHashJoinNewBatch (
    hjstate=<error reading variable: Cannot access memory at address 0x7ffd45fa9498>) at nodeHashjoin.c:1127
1127                            {
(

    I don't understand why all of them are at the same location.
      Doesn't make any sense to me.
    But I'll leave it at that right now.
    -Gunther