archive_command called for preallocated/recycled WAL?

marty kulma <martykulma@xxxxxxxxx> · Wed, 21 Dec 2016 09:53:06 -0500

We are running a PITR tests on Postgresql 9.5.4 (moving up from 9.1.18) and finding that postgres seems to be archiving an invalid WAL (full of nulls) in cases when a WAL is preallocated on shutdown.  Don't see this issue without preallocation during restartpoint.  We use this process on 9.1.18 and have not seen this issue.

I haven't found mention of this in the mail lists.  Has anyone seen this behavior?  Is it safe to remove archived WALs that are null if they overlap same segment in previous timeline?

System is recovered - replay of WALs completes and postgres is shutdown.  As part of shutdown, restartpoint is created and a new WAL is preallocated.
  2016-12-21 09:10:27.458 EST   12079 LOG:  database system was interrupted; last known up at 2016-12-20 06:27:13 EST
  2016-12-21 09:10:27.658 EST   12079 LOG:  restored log file "00000008.history" from archive
  2016-12-21 09:10:27.833 EST   12079 LOG:  entering standby mode
  2016-12-21 09:10:27.999 EST   12079 LOG:  restored log file "00000008.history" from archive
  2016-12-21 09:10:28.393 EST [local] postgres 12113 FATAL:  the database system is starting up
  2016-12-21 09:10:28.405 EST   12079 LOG:  restored log file "000000070000000000000076" from archive
  2016-12-21 09:10:28.577 EST   12079 LOG:  restored log file "00000007.history" from archive
  2016-12-21 09:10:28.580 EST   12079 LOG:  redo starts at 0/760000D0
  2016-12-21 09:10:28.582 EST   12079 LOG:  consistent recovery state reached at 0/7600AC70
  2016-12-21 09:10:28.583 EST   12077 LOG:  database system is ready to accept read only connections
  2016-12-21 09:10:28.977 EST   12079 LOG:  restored log file "000000070000000000000077" from archive
  2016-12-21 09:10:29.413 EST   12079 LOG:  restored log file "000000070000000000000078" from archive
  2016-12-21 09:10:29.819 EST   12079 LOG:  restored log file "000000070000000000000079" from archive
  2016-12-21 09:10:30.059 EST   12079 LOG:  restored log file "00000008000000000000007A" from archive
  2016-12-21 09:10:33.296 EST   12175 FATAL:  could not connect to the primary server: could not connect to server: No route to host
                  Is the server running on host "172.10.10.200" and accepting
                  TCP/IP connections on port 5434?

  2016-12-21 09:10:37.659 EST   12077 LOG:  received fast shutdown request
  2016-12-21 09:10:37.660 EST   12077 LOG:  aborting any active transactions
  2016-12-21 09:10:37.661 EST   12120 LOG:  shutting down
  2016-12-21 09:10:37.662 EST   12120 LOG:  restartpoint starting: shutdown immediate
  2016-12-21 09:10:37.705 EST   12120 LOG:  restartpoint complete: wrote 65 buffers (0.0%); 1 transaction log file(s) added, 0 removed, 0 recycled; write=0.010 s, sync=0.003 s, total=0.043 s; sync files=42, longest=0.000 s, average=0.000 s; distance=65535 kB, estimate=65535 kB
  2016-12-21 09:10:37.705 EST   12120 LOG:  recovery restart point at 0/7A0000C8
  2016-12-21 09:10:37.705 EST   12120 DETAIL:  last completed transaction was at log time 2016-12-20 07:21:56.032147-05
  2016-12-21 09:10:37.705 EST   12120 LOG:  database system is shut down

postgres has preallocated 00000008000000000000007B (inode 38797329) as part of shutdown
  38797329 -rw------- 1 postgres postgres 16777216 2016-12-21 09:10:37.696096222 -0500 00000008000000000000007B

Postgres is started and promoted. 
  2016-12-21 09:10:49.024 EST   12282 LOG:  database system was shut down in recovery at 2016-12-21 09:10:37 EST
  2016-12-21 09:10:49.210 EST   12282 LOG:  entering standby mode
  2016-12-21 09:10:49.400 EST   12282 LOG:  restored log file "00000008.history" from archive
  2016-12-21 09:10:49.639 EST   12282 LOG:  restored log file "00000008000000000000007A" from archive
  2016-12-21 09:10:49.654 EST   12282 LOG:  redo starts at 0/7A0000C8
  2016-12-21 09:10:49.654 EST   12282 LOG:  consistent recovery state reached at 0/7B000000
  2016-12-21 09:10:49.655 EST   12280 LOG:  database system is ready to accept read only connections
  2016-12-21 09:10:49.820 EST   12282 LOG:  invalid magic number 0000 in log segment 00000008000000000000007B, offset 0
  2016-12-21 09:10:52.880 EST   12313 FATAL:  could not connect to the primary server: could not connect to server: No route to host
                  Is the server running on host "172.10.10.200" and accepting
                  TCP/IP connections on port 5434?

  2016-12-21 09:10:55.548 EST   12282 LOG:  received promote request
  2016-12-21 09:10:55.548 EST   12350 FATAL:  terminating walreceiver process due to administrator command
  2016-12-21 09:10:55.905 EST   12282 LOG:  redo done at 0/7A0001A8
  2016-12-21 09:10:56.141 EST   12282 LOG:  restored log file "00000008000000000000007A" from archive
  2016-12-21 09:10:56.348 EST   12282 LOG:  selected new timeline ID: 9
  2016-12-21 09:10:56.515 EST   12282 LOG:  restored log file "00000008.history" from archive
  2016-12-21 09:10:56.546 EST   12282 LOG:  archive recovery complete
  2016-12-21 09:10:56.547 EST   12282 LOG:  MultiXact member wraparound protections are now enabled
  2016-12-21 09:10:56.547 EST   12304 LOG:  checkpoint starting: force

00000008000000000000007B appears to get recycled to 00000009000000000000007A and archived after promote.
  38797329 -rw------- 1 postgres postgres 16777216 2016-12-21 09:10:37.696096222 -0500 00000009000000000000007A

We performed a PITR with recovery_target_timeline = latest.  PITR retrieves  00000009000000000000007A, but fails to recover as the history file and WAL don't line up.  Removing 00000009000000000000007A corrects the issue and PITR succeeds.
  2016-12-21 09:19:42.608 EST   13030 LOG:  restored log file "00000009000000000000007A" from archive
  2016-12-21 09:19:42.616 EST   13030 LOG:  invalid magic number 0000 in log segment 00000009000000000000007A, offset 0
  2016-12-21 09:19:42.616 EST   13030 FATAL:  according to history file, WAL location 0/7A000000 belongs to timeline 7, but previous recovered WAL file came from timeline 9

# pg_xlogdump 00000007000000000000007A
rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 0/7A000028, prev 0/790000D0, desc: CHECKPOINT_SHUTDOWN redo 0/7A000028; tli 7; prev tli 7; fpw true; xid 0/450172; oid 37074; multi 1; offset 0; oldest xid 1869 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
pg_xlogdump: FATAL:  error in WAL record at 0/7A000028: invalid record length at 0/7A000098

# pg_xlogdump 00000008000000000000007A
rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 0/7A000028, prev 0/790000D0, desc: CHECKPOINT_SHUTDOWN redo 0/7A000028; tli 7; prev tli 7; fpw true; xid 0/450172; oid 37074; multi 1; offset 0; oldest xid 1869 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: XLOG        len (rec/tot):     16/    42, tx:          0, lsn: 0/7A000098, prev 0/7A000028, desc: END_OF_RECOVERY tli 8; prev tli 7; time 2016-12-20 08:39:29.793253 EST
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/7A0000C8, prev 0/7A000098, desc: RUNNING_XACTS nextXid 450172 latestCompletedXid 450171 oldestRunningXid 450172
rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 0/7A000100, prev 0/7A0000C8, desc: CHECKPOINT_ONLINE redo 0/7A0000C8; tli 8; prev tli 8; fpw true; xid 0/450172; oid 37074; multi 1; offset 0; oldest xid 1869 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 450172; online
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/7A000170, prev 0/7A000100, desc: RUNNING_XACTS nextXid 450172 latestCompletedXid 450171 oldestRunningXid 450172
rmgr: XLOG        len (rec/tot):      0/    24, tx:          0, lsn: 0/7A0001A8, prev 0/7A000170, desc: SWITCH 

# pg_xlogdump 00000009000000000000007A
pg_xlogdump: FATAL:  could not find a valid record after 0/7A000000

# cat 00000009.history 
1    0/3000098    no recovery target specified

2    0/67000098    no recovery target specified

3    0/6B000098    no recovery target specified

4    0/6E000000    no recovery target specified

5    0/70000098    no recovery target specified

6    0/74000000    no recovery target specified

7    0/7A000098    no recovery target specified

8    0/7B000000    no recovery target specified

-Marty Kulma