On 08/12/2018 12:25 PM, Phil Endecott wrote:
Hi Adrian,
Adrian Klaver wrote:
On 08/11/2018 12:42 PM, Phil Endecott wrote:
Hi Adrian,
Adrian Klaver wrote:
Looks like the master recycled the WAL's while the slave could not
connect.
Yes but... why is that a problem? The master is copying the WALs to
the backup server using scp, where they remain forever. The slave gets
To me it looks like that did not happen:
2018-08-11 00:05:50.364 UTC [615] LOG: restored log file
"0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such
file or directory
2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from
primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from
WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has
already been removed
Above 0000000100000007000000D0 is gone/recycled on the master and the
archived version does not seem to be complete as the streaming
replication is trying to find it.
The files on the backup server were all 16 MB.
WAL files are created/recycled as 16 MB files, which is not the same as
saying they are complete for the purposes of restoring. In other words
you could be looking at a 16 MB file full of 0's.
Below you kick the master and it coughs up the files to the archive
including *D0 and *D1 on up to *D4 and then the streaming picks using
*D5.
When I kicked it, the master wrote D1 to D4 to the backup. It did not
change D0 (its modification time on the backup is from before the "kick").
The slave re-read D0, again, as it had been doing throughout this period,
and then read D1 to D4.
Well something happened because the slave could not get all the
information it needed from the D0 in the archive and was trying to get
it from the masters pg_xlog.
Best guess is the archiving did not work as expected during:
"(During this time the master was also down for a shorter period.)"
Around the time the master was down, the WAL segment names were CB and CC.
Files CD to CF were written between the master coming up and the slave
coming up. The slave had no trouble restoring those segments when it
started.
The problematic segments D0 and D1 were the ones that were "current"
when the
slave restarted, at which time the master was up consistently.
Regards, Phil.
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx