Wall shiping replica failed to recover database with error: invalid contrecord length 1956 at FED/38FFE208

Aleš Zelený <zeleny.ales@xxxxxxxxx> · Wed, 2 Oct 2019 22:19:03 +0200

Hello,

we run out of disk space on our production primary database on file system dedicated for WALs (one of our logical replica died and thus WALs were accumulated).
As expeced, primary instance shuts down:

Primary instance:
PostgreSQL 11.3 (Ubuntu 11.3-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0, 64-bit

2019-09-29 19:24:28 UTC 2241 5d36488f.8c1 1    0    [53100]:PANIC:  could not write to file "pg_wal/xlogtemp.2241": No space left on device
2019-09-29 19:24:30 UTC 1706 5d364871.6aa 3387    0    [00000]:LOG:  WAL writer process (PID 2241) was terminated by signal 6: Aborted
2019-09-29 19:24:30 UTC 1706 5d364871.6aa 3388    0    [00000]:LOG:  terminating any other active server processes

2019-09-29 19:24:30 UTC 1706 5d364871.6aa 3390    0    [00000]:LOG:  all server processes terminated; reinitializing
2019-09-29 19:24:30 UTC 21499 5d9104ee.53fb 1    0    [00000]:LOG:  database system was interrupted; last known up at 2019-09-29 19:23:47 UTC

2019-09-29 19:24:33 UTC 21499 5d9104ee.53fb 2    0    [00000]:LOG:  recovered replication state of node 1 to CF3/442D0758
2019-09-29 19:24:33 UTC 21499 5d9104ee.53fb 3    0    [00000]:LOG:  database system was not properly shut down; automatic recovery in progress
2019-09-29 19:24:33 UTC 21499 5d9104ee.53fb 4    0    [00000]:LOG:  redo starts at FE9/A49830D8

2019-09-29 19:25:55 UTC 21499 5d9104ee.53fb 5    0    [00000]:LOG:  redo done at FED/38FFC540
2019-09-29 19:25:55 UTC 21499 5d9104ee.53fb 6    0    [00000]:LOG:  last completed transaction was at log time 2019-09-29 19:22:06.597333+00
2019-09-29 19:25:55 UTC 21499 5d9104ee.53fb 7    0    [00000]:LOG:  checkpoint starting: end-of-recovery immediate

2019-09-29 19:26:59 UTC 21499 5d9104ee.53fb 8    0    [00000]:LOG:  checkpoint complete: wrote 1046940 buffers (99.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=64.166 s, sync=0.008 s, total=64.212 s; sync files=184, longest=0.001 s, average=0.000 s; distance=15014380 kB, estimate=15014380 kB
2019-09-29 19:26:59 UTC 21499 5d9104ee.53fb 9    0    [53100]:FATAL:  could not write to file "pg_wal/xlogtemp.21499": No space left on device
2019-09-29 19:26:59 UTC 1706 5d364871.6aa 3391    0    [00000]:LOG:  startup process (PID 21499) exited with exit code 1
2019-09-29 19:26:59 UTC 1706 5d364871.6aa 3392    0    [00000]:LOG:  aborting startup due to startup process failure
2019-09-29 19:26:59 UTC 1706 5d364871.6aa 3393    0    [00000]:LOG:  database system is shut down

Expceted bahavior, free space on file system for WAL was 14MB, so not enough for next WAL file to be saved.

Once falied logical replica was back up & running, we have extened filesystem for WALs and started primary instance:

2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 1    0    [00000]:LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 2    0    [00000]:LOG:  pg_stat_kcache.linux_hz is set to 1000000
2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 3    0    [00000]:LOG:  listening on IPv4 address "0.0.0.0", port 5432
2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 4    0    [00000]:LOG:  listening on IPv6 address "::", port 5432
2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 5    0    [00000]:LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2019-09-30 08:21:44 UTC 13637 5d91bb18.3545 1    0    [00000]:LOG:  database system was shut down at 2019-09-29 19:26:59 UTC
2019-09-30 08:21:44 UTC 13637 5d91bb18.3545 2    0    [00000]:LOG:  recovered replication state of node 1 to CF3/442D0758
2019-09-30 08:21:44 UTC 13646 5d91bb18.354e 1    0    [00000]:LOG:  pg_partman master background worker master process initialized with role postgres
2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 6    0    [00000]:LOG:  database system is ready to accept connections
2019-09-30 08:21:44 UTC 13645 5d91bb18.354d 1    0 2/0   [00000]:LOG:  pg_cron scheduler started
2019-09-30 08:21:44 UTC 13647 5d91bb18.354f 1    0 3/0   [00000]:LOG:  POWA connected to database powa

So primary instance stated, all logical replications streams connected and as soon as logical replication for failed subscriber process WALs, unnecessery WALs were deleted as expected. Until this point, all works like a charm  from primary instance point of view.

Form this primary instance we are running wall shipping replica (pgBackRest is used) and this is where issues started.

WAL recovery replica:
PostgreSQL 11.3 (Ubuntu 11.3-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0, 64-bit

WAL files were restored from backup till last one backed up before primary instance shut down.

2019-09-29 19:24:31.390 P00   INFO: found 0000000100000FED00000037 in the archive
2019-09-29 19:24:31.391 P00   INFO: archive-get command end: completed successfully (43ms)
2019-09-29 19:24:31 UTC 3062 5d76145f.bf6 181012    0 1/0   [00000]:LOG:  restored log file "0000000100000FED00000037" from archive
2019-09-29 19:24:31.919 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000038, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-29 19:24:31.958 P00   INFO: found 0000000100000FED00000038 in the archive
2019-09-29 19:24:31.959 P00   INFO: archive-get command end: completed successfully (41ms)
2019-09-29 19:24:31 UTC 3062 5d76145f.bf6 181013    0 1/0   [00000]:LOG:  restored log file "0000000100000FED00000038" from archive
2019-09-29 19:24:32.505 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-29 19:24:32.505 P00   INFO: unable to find 0000000100000FED00000039 in the archive
2019-09-29 19:24:32.505 P00   INFO: archive-get command end: completed successfully (2ms)
2019-09-29 19:24:34.247 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-29 19:24:34.448 P00   INFO: unable to find 0000000100000FED00000039 in the archive
2019-09-29 19:24:34.448 P00   INFO: archive-get command end: completed successfully (204ms)

So the replica instance was paintfully waiting and waiting and waiting for next WAL: 
0000000100000FED00000039 

And yes, after the long outage pgBackRest found and restored WAL 
0000000100000FED00000039:

2019-09-30 08:22:03.673 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/
pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<red
acted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-30 08:22:03.874 P00   INFO: unable to find 0000000100000FED00000039 in the archive
2019-09-30 08:22:03.875 P00   INFO: archive-get command end: completed successfully (203ms)
2019-09-30 08:22:08.678 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/
pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<red
acted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-30 08:22:09.121 P00   INFO: found 0000000100000FED00000039 in the archive
2019-09-30 08:22:09.123 P00   INFO: archive-get command end: completed successfully (446ms)
2019-09-30 08:22:09 UTC 3062 5d76145f.bf6 181014    0 1/0   [00000]:LOG:  restored log file "0000000100000FED00000039" from archive
2019-09-30 08:22:09 UTC 3062 5d76145f.bf6 181015    0 1/0   [00000]:LOG:  invalid contrecord length 1956 at FED/38FFE208
2019-09-30 08:22:13.683 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000038, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-30 08:22:14.132 P00   INFO: found 0000000100000FED00000038 in the archive
2019-09-30 08:22:14.135 P00   INFO: archive-get command end: completed successfully (453ms)
2019-09-30 08:22:14 UTC 3062 5d76145f.bf6 181016    0 1/0   [00000]:LOG:  restored log file "0000000100000FED00000038" from archive
2019-09-30 08:22:14.195 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-30 08:22:14.236 P00   INFO: found 0000000100000FED00000039 in the archive
2019-09-30 08:22:14.238 P00   INFO: archive-get command end: completed successfully (44ms)
2019-09-30 08:22:14 UTC 3062 5d76145f.bf6 181017    0 1/0   [00000]:LOG:  restored log file "0000000100000FED00000039" from archive
2019-09-30 08:22:14 UTC 3062 5d76145f.bf6 181018    0 1/0   [00000]:LOG:  invalid contrecord length 1956 at FED/38FFE208

But recovery on replica failed to proceed WAL file 
0000000100000FED00000039 

with log message: "
invalid contrecord length 1956 at FED/38FFE208".

Since the WAL 
0000000100000FED00000039 

still exists on primary instance, I've copied it to WAL replica, but it was identical to the file restored by pgBackRest, so I assume from this, that WAL file backup/restore was not root cause fro this issue.

Recovery conf on replica instance is simple:

standby_mode = 'on'
restore_command = 'pgbackrest --log-level-console=info --process-max=6 --stanza=anl_master_prod archive-get %f "%p"'

During planned properly proceeded shutdown on primary instance, recovery on WAL replica resumes smoothly.

And for sure, after restoring database from backup taken past the primary instance issue, WAL replica works perfectly, as usual.

Except the obvious need to re-configure monitoring to warn on WAL free space sooner, to avoid such case, my expectation was, taht WAL replica will be able to resume recovery it was not.

We have stored copy of 
0000000100000FED0000003[89] WAL file for analyses, is there something what can help us to find the issu root cause and mitigate this issue?

I've found 
BUG #15412,seems to be same issue, but remains open.

Thanks for any hints how to recover from this issue or analyze it deeper to find root cause.

Kind regards Ales Zeleny