Replication failure, slave requesting old segments

"Phil Endecott" <spam_from_pgsql_lists@xxxxxxxxxxxx> · Sat, 11 Aug 2018 16:18:55 +0100

Dear Experts,

I recently set up replication for the first time.  It seemed to be 
working OK in my initial tests, but didn't cope when the slave was 
down for a longer period.  This is all with the Debian stable 
packages of PostgreSQL 9.6.  My replication setup involves a third 
server, "backup", to and from which WAL files are copied using scp 
(and essentially never deleted), plus streaming replication in 
"hot standby" mode.

On the master, I have:

wal_level = replica
archive_mode = on
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
                   scp %p backup:backup/postgresql/archivedir/%f'

On the slave I have:

standby_mode = 'on'
primary_conninfo = 'user=postgres host=master port=5432'
restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'

hot_standby = on

This all seemed to work OK until the slave was shut down for a few days.  
(During this time the master was also down for a shorter period.) 
When the slave came back up it successfully copied several WAL files 
from the backup server until it reached one that hasn't been created yet:

2018-08-10 22:37:38.322 UTC [615] LOG:  restored log file "0000000100000007000000CC" from archive
2018-08-10 22:38:02.011 UTC [615] LOG:  restored log file "0000000100000007000000CD" from archive
2018-08-10 22:38:29.802 UTC [615] LOG:  restored log file "0000000100000007000000CE" from archive
2018-08-10 22:38:55.973 UTC [615] LOG:  restored log file "0000000100000007000000CF" from archive
2018-08-10 22:39:24.964 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory

At this point there was a temporary problem due to a firewall rule that 
blocked the replication connection to the master:

2018-08-10 22:41:35.865 UTC [1289] FATAL:  could not connect to the primary server: SSL connection has been closed unexpectedly

The slave then entered a loop, retrying to scp the next file from the backup 
archive (and failing) and connecting to the master for streaming replication 
(and also failing).  That is as expected, except for the "invalid magic number":

scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-10 22:41:37.018 UTC [615] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
2018-08-10 22:43:46.936 UTC [1445] FATAL:  could not connect to the primary server: SSL connection has been closed unexpectedly

Note that it's complaining about an invalid magic number in log segment D1, which is 
the one that has just failed to scp.  That looked suspicious to me and I remembered 
notes in the docs about ensuring that the archive and/or restore commands return an 
error exit status when they fail; I checked that and yes scp does exit(1) when the 
requested file doesn't exist:

$ scp backup:/egheriugherg /tmp/a
scp: /egheriugherg: No such file or directory
$ echo $?
1

Anyway, the slave continued in its retrying loop as expected, except that each time 
it re-fetched the previous segment, D0; this seems a bit peculiar:

2018-08-10 22:44:17.796 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-10 22:46:28.727 UTC [1628] FATAL:  could not connect to the primary server: SSL connection has been closed unexpectedly

(Note the "invalid magic number" was only reported the first time.)

Eventually I fixed the networking issue that had prevented the replication connection 
to master and it could start streaming:

2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100000007000000D0 has already been removed

So here is the real problem which worries me: it seems to ask the master for 
segment D0, which is one that it did successfully restore from the backup earlier 
on, and the master - not unexpectedly - reports that it has already been removed.

I tried restarting the slave:

2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 7/D0FFF088
2018-08-11 00:12:16.068 UTC [7953] LOG:  database system is ready to accept read only connections
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:12:17.057 UTC [7954] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
2018-08-11 00:12:17.242 UTC [8034] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:12:17.242 UTC [8034] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100000007000000D0 has already been removed

Note that again it failed to get D1 from the backup, as expected, and then 
reports "invalid magic number" in that file, and again tries to start 
streaming from a segment that the master doesn't have.

Eventually I decided to try and cause the D1 file to exist in the backup:

db=> create table waste_space as select * from large_table;

That caused four new WAL segments to be copied to the backup, and restored 
by the slave:

2018-08-11 00:55:49.741 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
2018-08-11 00:56:12.304 UTC [7954] LOG:  restored log file "0000000100000007000000D1" from archive
2018-08-11 00:56:35.481 UTC [7954] LOG:  restored log file "0000000100000007000000D2" from archive
2018-08-11 00:56:57.443 UTC [7954] LOG:  restored log file "0000000100000007000000D3" from archive
2018-08-11 00:57:21.723 UTC [7954] LOG:  restored log file "0000000100000007000000D4" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory
2018-08-11 00:57:22.915 UTC [7954] LOG:  unexpected pageaddr 7/C7000000 in log segment 00000001000000070000
00D5, offset 0
2018-08-11 00:57:23.114 UTC [12348] LOG:  started streaming WAL from primary at 7/D5000000 on timeline 1

Note the "unexpected pageaddr" message.  Despite that, the streaming started 
and everything was working correctly.  Hurray!

I remain concerned, however, that something bad happened and I'd like to 
understand it.  Can anyone help?

Thanks,

Phil.