What to do when dynamic shared memory control segment is corrupt

Sherrylyn Branchaw <sbranchaw@xxxxxxxxx> · Mon, 18 Jun 2018 11:34:11 -0400

Greetings,

We are using Postgres 9.6.8 (planning to upgrade to 9.6.9 soon) on RHEL 6.9.

We recently experienced two similar outages on two different prod databases. The error messages from the logs were as follows:

LOG:  server process (PID 138529) was terminated by signal 6: Aborted
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

We are still investigating what may have triggered these errors, since there were no recent changes to these databases. Unfortunately, core dumps were not configured correctly, so we may have to wait for the next outage before we can do a good root cause analysis.

My question, meanwhile, is around remedial actions to take when this happens.

In one case, the logs recorded

LOG:  all server processes terminated; reinitializing
LOG:  incomplete data in "postmaster.pid": found only 1 newlines while trying to add line 7
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 365/FDFA738
LOG:  invalid record length at 365/12420978: wanted 24, got 0
LOG:  redo done at 365/12420950
LOG:  last completed transaction was at log time 2018-06-05 10:59:27.049443-05
LOG:  checkpoint starting: end-of-recovery immediate
LOG:  checkpoint complete: wrote 5343 buffers (0.5%); 0 transaction log file(s) added, 1 removed, 0 recycled; write=0.131 s, sync=0.009 s, total=0.164 s; sync files=142, longest=0.005 s, average=0.000 s; distance=39064 kB, estimate=39064 kB
LOG:  MultiXact member wraparound protections are now enabled
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections

In that case, the database restarted immediately, with only 30 seconds of downtime.

In the other case, the logs recorded

LOG:  all server processes terminated; reinitializing
LOG:  dynamic shared memory control segment is corrupt
LOG:  incomplete data in "postmaster.pid": found only 1 newlines while trying to add line 7

In that case, the database did not restart on its own. It was 5 am on Sunday, so the on-call SRE just manually started the database up, and it appears to have been running fine since.

My question is whether the corrupt shared memory control segment, and the failure of Postgres to automatically restart, mean the database should not be automatically started up, and if there's something we should be doing before restarting. 

Do we potentially have corrupt data or indices as a result of our last outage? If so, what should we do to investigate?

Thanks,
Sherrylyn