Re: Replication failure, slave requesting old segments

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Sat, 11 Aug 2018 14:48:01 -0700

On 08/11/2018 12:42 PM, Phil Endecott wrote:
Hi Adrian,

Adrian Klaver wrote:
Looks like the master recycled the WAL's while the slave could not 
connect.

Yes but... why is that a problem?  The master is copying the WALs to
the backup server using scp, where they remain forever.  The slave gets

To me it looks like that did not happen:

2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file 
"0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file 
or directory
2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from 
primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from 
WAL stream: ERROR:  requested WAL segment 0000000100000007000000D0 has 
already been removed

Above 0000000100000007000000D0 is gone/recycled on the master and the 
archived version does not seem to be complete as the streaming 
replication is trying to find it.

Below you kick the master and it coughs up the files to the archive 
including *D0 and *D1 on up to *D4 and then the streaming picks using *D5.

2018-08-11 00:55:49.741 UTC [7954] LOG:  restored log file 
"0000000100000007000000D0" from archive
2018-08-11 00:56:12.304 UTC [7954] LOG:  restored log file 
"0000000100000007000000D1" from archive
2018-08-11 00:56:35.481 UTC [7954] LOG:  restored log file 
"0000000100000007000000D2" from archive
2018-08-11 00:56:57.443 UTC [7954] LOG:  restored log file 
"0000000100000007000000D3" from archive
2018-08-11 00:57:21.723 UTC [7954] LOG:  restored log file 
"0000000100000007000000D4" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file 
or directory
2018-08-11 00:57:22.915 UTC [7954] LOG:  unexpected pageaddr 7/C7000000 
in log segment 00000001000000070000
00D5, offset 0
2018-08-11 00:57:23.114 UTC [12348] LOG:  started streaming WAL from 
primary at 7/D5000000 on timeline 1

Best guess is the archiving did not work as expected during:

"(During this time the master was also down for a shorter period.)"

them from there before it starts streaming.  So it shouldn't matter
if the master recycles them, as the slave should be able to get everything
using the combination of scp and then streaming.

Am I missing something about how this sort of replication is supposed to
work?

Thanks, Phil.

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx