We are running a PITR tests on Postgresql 9.5.4 (moving up from 9.1.18) and finding that postgres seems to be archiving an invalid WAL (full of nulls) in cases when a WAL is preallocated on shutdown. Don't see this issue without preallocation during restartpoint. We use this process on 9.1.18 and have not seen this issue.
I haven't found mention of this in the mail lists. Has anyone seen this behavior? Is it safe to remove archived WALs that are null if they overlap same segment in previous timeline?
System is recovered - replay of WALs completes and postgres is shutdown. As part of shutdown, restartpoint is created and a new WAL is preallocated.
2016-12-21 09:10:27.458 EST 12079 LOG: database system was interrupted; last known up at 2016-12-20 06:27:13 EST
2016-12-21 09:10:27.658 EST 12079 LOG: restored log file "00000008.history" from archive
2016-12-21 09:10:27.833 EST 12079 LOG: entering standby mode
2016-12-21 09:10:27.999 EST 12079 LOG: restored log file "00000008.history" from archive
2016-12-21 09:10:28.393 EST [local] postgres 12113 FATAL: the database system is starting up
2016-12-21 09:10:28.405 EST 12079 LOG: restored log file "000000070000000000000076" from archive
2016-12-21 09:10:28.577 EST 12079 LOG: restored log file "00000007.history" from archive
2016-12-21 09:10:28.580 EST 12079 LOG: redo starts at 0/760000D0
2016-12-21 09:10:28.582 EST 12079 LOG: consistent recovery state reached at 0/7600AC70
2016-12-21 09:10:28.583 EST 12077 LOG: database system is ready to accept read only connections
2016-12-21 09:10:28.977 EST 12079 LOG: restored log file "000000070000000000000077" from archive
2016-12-21 09:10:29.413 EST 12079 LOG: restored log file "000000070000000000000078" from archive
2016-12-21 09:10:29.819 EST 12079 LOG: restored log file "000000070000000000000079" from archive
2016-12-21 09:10:30.059 EST 12079 LOG: restored log file "00000008000000000000007A" from archive
2016-12-21 09:10:33.296 EST 12175 FATAL: could not connect to the primary server: could not connect to server: No route to host
Is the server running on host "172.10.10.200" and accepting
TCP/IP connections on port 5434?
2016-12-21 09:10:37.659 EST 12077 LOG: received fast shutdown request
2016-12-21 09:10:37.660 EST 12077 LOG: aborting any active transactions
2016-12-21 09:10:37.661 EST 12120 LOG: shutting down
2016-12-21 09:10:37.662 EST 12120 LOG: restartpoint starting: shutdown immediate
2016-12-21 09:10:37.705 EST 12120 LOG: restartpoint complete: wrote 65 buffers (0.0%); 1 transaction log file(s) added, 0 removed, 0 recycled; write=0.010 s, sync=0.003 s, total=0.043 s; sync files=42, longest=0.000 s, average=0.000 s; distance=65535 kB, estimate=65535 kB
2016-12-21 09:10:37.705 EST 12120 LOG: recovery restart point at 0/7A0000C8
2016-12-21 09:10:37.705 EST 12120 DETAIL: last completed transaction was at log time 2016-12-20 07:21:56.032147-05
2016-12-21 09:10:37.705 EST 12120 LOG: database system is shut down
postgres has preallocated 00000008000000000000007B (inode 38797329) as part of shutdown
38797329 -rw------- 1 postgres postgres 16777216 2016-12-21 09:10:37.696096222 -0500 00000008000000000000007B
Postgres is started and promoted.
2016-12-21 09:10:49.024 EST 12282 LOG: database system was shut down in recovery at 2016-12-21 09:10:37 EST
2016-12-21 09:10:49.210 EST 12282 LOG: entering standby mode
2016-12-21 09:10:49.400 EST 12282 LOG: restored log file "00000008.history" from archive
2016-12-21 09:10:49.639 EST 12282 LOG: restored log file "00000008000000000000007A" from archive
2016-12-21 09:10:49.654 EST 12282 LOG: redo starts at 0/7A0000C8
2016-12-21 09:10:49.654 EST 12282 LOG: consistent recovery state reached at 0/7B000000
2016-12-21 09:10:49.655 EST 12280 LOG: database system is ready to accept read only connections
2016-12-21 09:10:49.820 EST 12282 LOG: invalid magic number 0000 in log segment 00000008000000000000007B, offset 0
2016-12-21 09:10:52.880 EST 12313 FATAL: could not connect to the primary server: could not connect to server: No route to host
Is the server running on host "172.10.10.200" and accepting
TCP/IP connections on port 5434?
2016-12-21 09:10:55.548 EST 12282 LOG: received promote request
2016-12-21 09:10:55.548 EST 12350 FATAL: terminating walreceiver process due to administrator command
2016-12-21 09:10:55.905 EST 12282 LOG: redo done at 0/7A0001A8
2016-12-21 09:10:56.141 EST 12282 LOG: restored log file "00000008000000000000007A" from archive
2016-12-21 09:10:56.348 EST 12282 LOG: selected new timeline ID: 9
2016-12-21 09:10:56.515 EST 12282 LOG: restored log file "00000008.history" from archive
2016-12-21 09:10:56.546 EST 12282 LOG: archive recovery complete
2016-12-21 09:10:56.547 EST 12282 LOG: MultiXact member wraparound protections are now enabled
2016-12-21 09:10:56.547 EST 12304 LOG: checkpoint starting: force
00000008000000000000007B appears to get recycled to 00000009000000000000007A and archived after promote.
38797329 -rw------- 1 postgres postgres 16777216 2016-12-21 09:10:37.696096222 -0500 00000009000000000000007A
We performed a PITR with recovery_target_timeline = latest. PITR retrieves 00000009000000000000007A, but fails to recover as the history file and WAL don't line up. Removing 00000009000000000000007A corrects the issue and PITR succeeds.
2016-12-21 09:19:42.608 EST 13030 LOG: restored log file "00000009000000000000007A" from archive
2016-12-21 09:19:42.616 EST 13030 LOG: invalid magic number 0000 in log segment 00000009000000000000007A, offset 0
2016-12-21 09:19:42.616 EST 13030 FATAL: according to history file, WAL location 0/7A000000 belongs to timeline 7, but previous recovered WAL file came from timeline 9
# pg_xlogdump 00000007000000000000007A
rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 0/7A000028, prev 0/790000D0, desc: CHECKPOINT_SHUTDOWN redo 0/7A000028; tli 7; prev tli 7; fpw true; xid 0/450172; oid 37074; multi 1; offset 0; oldest xid 1869 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
pg_xlogdump: FATAL: error in WAL record at 0/7A000028: invalid record length at 0/7A000098
# pg_xlogdump 00000008000000000000007A
rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 0/7A000028, prev 0/790000D0, desc: CHECKPOINT_SHUTDOWN redo 0/7A000028; tli 7; prev tli 7; fpw true; xid 0/450172; oid 37074; multi 1; offset 0; oldest xid 1869 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: XLOG len (rec/tot): 16/ 42, tx: 0, lsn: 0/7A000098, prev 0/7A000028, desc: END_OF_RECOVERY tli 8; prev tli 7; time 2016-12-20 08:39:29.793253 EST
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn: 0/7A0000C8, prev 0/7A000098, desc: RUNNING_XACTS nextXid 450172 latestCompletedXid 450171 oldestRunningXid 450172
rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 0/7A000100, prev 0/7A0000C8, desc: CHECKPOINT_ONLINE redo 0/7A0000C8; tli 8; prev tli 8; fpw true; xid 0/450172; oid 37074; multi 1; offset 0; oldest xid 1869 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 450172; online
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn: 0/7A000170, prev 0/7A000100, desc: RUNNING_XACTS nextXid 450172 latestCompletedXid 450171 oldestRunningXid 450172
rmgr: XLOG len (rec/tot): 0/ 24, tx: 0, lsn: 0/7A0001A8, prev 0/7A000170, desc: SWITCH
# pg_xlogdump 00000009000000000000007A
pg_xlogdump: FATAL: could not find a valid record after 0/7A000000
# cat 00000009.history
1 0/3000098 no recovery target specified
2 0/67000098 no recovery target specified
3 0/6B000098 no recovery target specified
4 0/6E000000 no recovery target specified
5 0/70000098 no recovery target specified
6 0/74000000 no recovery target specified
7 0/7A000098 no recovery target specified
8 0/7B000000 no recovery target specified
I haven't found mention of this in the mail lists. Has anyone seen this behavior? Is it safe to remove archived WALs that are null if they overlap same segment in previous timeline?
System is recovered - replay of WALs completes and postgres is shutdown. As part of shutdown, restartpoint is created and a new WAL is preallocated.
2016-12-21 09:10:27.458 EST 12079 LOG: database system was interrupted; last known up at 2016-12-20 06:27:13 EST
2016-12-21 09:10:27.658 EST 12079 LOG: restored log file "00000008.history" from archive
2016-12-21 09:10:27.833 EST 12079 LOG: entering standby mode
2016-12-21 09:10:27.999 EST 12079 LOG: restored log file "00000008.history" from archive
2016-12-21 09:10:28.393 EST [local] postgres 12113 FATAL: the database system is starting up
2016-12-21 09:10:28.405 EST 12079 LOG: restored log file "000000070000000000000076" from archive
2016-12-21 09:10:28.577 EST 12079 LOG: restored log file "00000007.history" from archive
2016-12-21 09:10:28.580 EST 12079 LOG: redo starts at 0/760000D0
2016-12-21 09:10:28.582 EST 12079 LOG: consistent recovery state reached at 0/7600AC70
2016-12-21 09:10:28.583 EST 12077 LOG: database system is ready to accept read only connections
2016-12-21 09:10:28.977 EST 12079 LOG: restored log file "000000070000000000000077" from archive
2016-12-21 09:10:29.413 EST 12079 LOG: restored log file "000000070000000000000078" from archive
2016-12-21 09:10:29.819 EST 12079 LOG: restored log file "000000070000000000000079" from archive
2016-12-21 09:10:30.059 EST 12079 LOG: restored log file "00000008000000000000007A" from archive
2016-12-21 09:10:33.296 EST 12175 FATAL: could not connect to the primary server: could not connect to server: No route to host
Is the server running on host "172.10.10.200" and accepting
TCP/IP connections on port 5434?
2016-12-21 09:10:37.659 EST 12077 LOG: received fast shutdown request
2016-12-21 09:10:37.660 EST 12077 LOG: aborting any active transactions
2016-12-21 09:10:37.661 EST 12120 LOG: shutting down
2016-12-21 09:10:37.662 EST 12120 LOG: restartpoint starting: shutdown immediate
2016-12-21 09:10:37.705 EST 12120 LOG: restartpoint complete: wrote 65 buffers (0.0%); 1 transaction log file(s) added, 0 removed, 0 recycled; write=0.010 s, sync=0.003 s, total=0.043 s; sync files=42, longest=0.000 s, average=0.000 s; distance=65535 kB, estimate=65535 kB
2016-12-21 09:10:37.705 EST 12120 LOG: recovery restart point at 0/7A0000C8
2016-12-21 09:10:37.705 EST 12120 DETAIL: last completed transaction was at log time 2016-12-20 07:21:56.032147-05
2016-12-21 09:10:37.705 EST 12120 LOG: database system is shut down
postgres has preallocated 00000008000000000000007B (inode 38797329) as part of shutdown
38797329 -rw------- 1 postgres postgres 16777216 2016-12-21 09:10:37.696096222 -0500 00000008000000000000007B
Postgres is started and promoted.
2016-12-21 09:10:49.024 EST 12282 LOG: database system was shut down in recovery at 2016-12-21 09:10:37 EST
2016-12-21 09:10:49.210 EST 12282 LOG: entering standby mode
2016-12-21 09:10:49.400 EST 12282 LOG: restored log file "00000008.history" from archive
2016-12-21 09:10:49.639 EST 12282 LOG: restored log file "00000008000000000000007A" from archive
2016-12-21 09:10:49.654 EST 12282 LOG: redo starts at 0/7A0000C8
2016-12-21 09:10:49.654 EST 12282 LOG: consistent recovery state reached at 0/7B000000
2016-12-21 09:10:49.655 EST 12280 LOG: database system is ready to accept read only connections
2016-12-21 09:10:49.820 EST 12282 LOG: invalid magic number 0000 in log segment 00000008000000000000007B, offset 0
2016-12-21 09:10:52.880 EST 12313 FATAL: could not connect to the primary server: could not connect to server: No route to host
Is the server running on host "172.10.10.200" and accepting
TCP/IP connections on port 5434?
2016-12-21 09:10:55.548 EST 12282 LOG: received promote request
2016-12-21 09:10:55.548 EST 12350 FATAL: terminating walreceiver process due to administrator command
2016-12-21 09:10:55.905 EST 12282 LOG: redo done at 0/7A0001A8
2016-12-21 09:10:56.141 EST 12282 LOG: restored log file "00000008000000000000007A" from archive
2016-12-21 09:10:56.348 EST 12282 LOG: selected new timeline ID: 9
2016-12-21 09:10:56.515 EST 12282 LOG: restored log file "00000008.history" from archive
2016-12-21 09:10:56.546 EST 12282 LOG: archive recovery complete
2016-12-21 09:10:56.547 EST 12282 LOG: MultiXact member wraparound protections are now enabled
2016-12-21 09:10:56.547 EST 12304 LOG: checkpoint starting: force
00000008000000000000007B appears to get recycled to 00000009000000000000007A and archived after promote.
38797329 -rw------- 1 postgres postgres 16777216 2016-12-21 09:10:37.696096222 -0500 00000009000000000000007A
We performed a PITR with recovery_target_timeline = latest. PITR retrieves 00000009000000000000007A, but fails to recover as the history file and WAL don't line up. Removing 00000009000000000000007A corrects the issue and PITR succeeds.
2016-12-21 09:19:42.608 EST 13030 LOG: restored log file "00000009000000000000007A" from archive
2016-12-21 09:19:42.616 EST 13030 LOG: invalid magic number 0000 in log segment 00000009000000000000007A, offset 0
2016-12-21 09:19:42.616 EST 13030 FATAL: according to history file, WAL location 0/7A000000 belongs to timeline 7, but previous recovered WAL file came from timeline 9
# pg_xlogdump 00000007000000000000007A
rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 0/7A000028, prev 0/790000D0, desc: CHECKPOINT_SHUTDOWN redo 0/7A000028; tli 7; prev tli 7; fpw true; xid 0/450172; oid 37074; multi 1; offset 0; oldest xid 1869 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
pg_xlogdump: FATAL: error in WAL record at 0/7A000028: invalid record length at 0/7A000098
# pg_xlogdump 00000008000000000000007A
rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 0/7A000028, prev 0/790000D0, desc: CHECKPOINT_SHUTDOWN redo 0/7A000028; tli 7; prev tli 7; fpw true; xid 0/450172; oid 37074; multi 1; offset 0; oldest xid 1869 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
rmgr: XLOG len (rec/tot): 16/ 42, tx: 0, lsn: 0/7A000098, prev 0/7A000028, desc: END_OF_RECOVERY tli 8; prev tli 7; time 2016-12-20 08:39:29.793253 EST
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn: 0/7A0000C8, prev 0/7A000098, desc: RUNNING_XACTS nextXid 450172 latestCompletedXid 450171 oldestRunningXid 450172
rmgr: XLOG len (rec/tot): 80/ 106, tx: 0, lsn: 0/7A000100, prev 0/7A0000C8, desc: CHECKPOINT_ONLINE redo 0/7A0000C8; tli 8; prev tli 8; fpw true; xid 0/450172; oid 37074; multi 1; offset 0; oldest xid 1869 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 450172; online
rmgr: Standby len (rec/tot): 24/ 50, tx: 0, lsn: 0/7A000170, prev 0/7A000100, desc: RUNNING_XACTS nextXid 450172 latestCompletedXid 450171 oldestRunningXid 450172
rmgr: XLOG len (rec/tot): 0/ 24, tx: 0, lsn: 0/7A0001A8, prev 0/7A000170, desc: SWITCH
# pg_xlogdump 00000009000000000000007A
pg_xlogdump: FATAL: could not find a valid record after 0/7A000000
# cat 00000009.history
1 0/3000098 no recovery target specified
2 0/67000098 no recovery target specified
3 0/6B000098 no recovery target specified
4 0/6E000000 no recovery target specified
5 0/70000098 no recovery target specified
6 0/74000000 no recovery target specified
7 0/7A000098 no recovery target specified
8 0/7B000000 no recovery target specified
-Marty Kulma