Dear Experts, I recently set up replication for the first time. It seemed to be working OK in my initial tests, but didn't cope when the slave was down for a longer period. This is all with the Debian stable packages of PostgreSQL 9.6. My replication setup involves a third server, "backup", to and from which WAL files are copied using scp (and essentially never deleted), plus streaming replication in "hot standby" mode. On the master, I have: wal_level = replica archive_mode = on archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f && scp %p backup:backup/postgresql/archivedir/%f' On the slave I have: standby_mode = 'on' primary_conninfo = 'user=postgres host=master port=5432' restore_command = 'scp backup:backup/postgresql/archivedir/%f %p' hot_standby = on This all seemed to work OK until the slave was shut down for a few days. (During this time the master was also down for a shorter period.) When the slave came back up it successfully copied several WAL files from the backup server until it reached one that hasn't been created yet: 2018-08-10 22:37:38.322 UTC [615] LOG: restored log file "0000000100000007000000CC" from archive 2018-08-10 22:38:02.011 UTC [615] LOG: restored log file "0000000100000007000000CD" from archive 2018-08-10 22:38:29.802 UTC [615] LOG: restored log file "0000000100000007000000CE" from archive 2018-08-10 22:38:55.973 UTC [615] LOG: restored log file "0000000100000007000000CF" from archive 2018-08-10 22:39:24.964 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory At this point there was a temporary problem due to a firewall rule that blocked the replication connection to the master: 2018-08-10 22:41:35.865 UTC [1289] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly The slave then entered a loop, retrying to scp the next file from the backup archive (and failing) and connecting to the master for streaming replication (and also failing). That is as expected, except for the "invalid magic number": scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory 2018-08-10 22:41:37.018 UTC [615] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0 2018-08-10 22:43:46.936 UTC [1445] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly Note that it's complaining about an invalid magic number in log segment D1, which is the one that has just failed to scp. That looked suspicious to me and I remembered notes in the docs about ensuring that the archive and/or restore commands return an error exit status when they fail; I checked that and yes scp does exit(1) when the requested file doesn't exist: $ scp backup:/egheriugherg /tmp/a scp: /egheriugherg: No such file or directory $ echo $? 1 Anyway, the slave continued in its retrying loop as expected, except that each time it re-fetched the previous segment, D0; this seems a bit peculiar: 2018-08-10 22:44:17.796 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory 2018-08-10 22:46:28.727 UTC [1628] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly (Note the "invalid magic number" was only reported the first time.) Eventually I fixed the networking issue that had prevented the replication connection to master and it could start streaming: 2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory 2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removed So here is the real problem which worries me: it seems to ask the master for segment D0, which is one that it did successfully restore from the backup earlier on, and the master - not unexpectedly - reports that it has already been removed. I tried restarting the slave: 2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive 2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0 2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088 2018-08-11 00:12:16.068 UTC [7953] LOG: database system is ready to accept read only connections scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory 2018-08-11 00:12:17.057 UTC [7954] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0 2018-08-11 00:12:17.242 UTC [8034] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 2018-08-11 00:12:17.242 UTC [8034] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removed Note that again it failed to get D1 from the backup, as expected, and then reports "invalid magic number" in that file, and again tries to start streaming from a segment that the master doesn't have. Eventually I decided to try and cause the D1 file to exist in the backup: db=> create table waste_space as select * from large_table; That caused four new WAL segments to be copied to the backup, and restored by the slave: 2018-08-11 00:55:49.741 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive 2018-08-11 00:56:12.304 UTC [7954] LOG: restored log file "0000000100000007000000D1" from archive 2018-08-11 00:56:35.481 UTC [7954] LOG: restored log file "0000000100000007000000D2" from archive 2018-08-11 00:56:57.443 UTC [7954] LOG: restored log file "0000000100000007000000D3" from archive 2018-08-11 00:57:21.723 UTC [7954] LOG: restored log file "0000000100000007000000D4" from archive scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory 2018-08-11 00:57:22.915 UTC [7954] LOG: unexpected pageaddr 7/C7000000 in log segment 00000001000000070000 00D5, offset 0 2018-08-11 00:57:23.114 UTC [12348] LOG: started streaming WAL from primary at 7/D5000000 on timeline 1 Note the "unexpected pageaddr" message. Despite that, the streaming started and everything was working correctly. Hurray! I remain concerned, however, that something bad happened and I'd like to understand it. Can anyone help? Thanks, Phil.