Search Postgresql Archives

Re: Replication failure, slave requesting old segments

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 08/11/2018 08:18 AM, Phil Endecott wrote:
Dear Experts,

I recently set up replication for the first time.  It seemed to be
working OK in my initial tests, but didn't cope when the slave was
down for a longer period.  This is all with the Debian stable
packages of PostgreSQL 9.6.  My replication setup involves a third
server, "backup", to and from which WAL files are copied using scp
(and essentially never deleted), plus streaming replication in
"hot standby" mode.

On the master, I have:

wal_level = replica
archive_mode = on
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
                    scp %p backup:backup/postgresql/archivedir/%f'

On the slave I have:

standby_mode = 'on'
primary_conninfo = 'user=postgres host=master port=5432'
restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'

hot_standby = on


This all seemed to work OK until the slave was shut down for a few days.
(During this time the master was also down for a shorter period.)
When the slave came back up it successfully copied several WAL files
from the backup server until it reached one that hasn't been created yet:

2018-08-10 22:37:38.322 UTC [615] LOG:  restored log file "0000000100000007000000CC" from archive
2018-08-10 22:38:02.011 UTC [615] LOG:  restored log file "0000000100000007000000CD" from archive
2018-08-10 22:38:29.802 UTC [615] LOG:  restored log file "0000000100000007000000CE" from archive
2018-08-10 22:38:55.973 UTC [615] LOG:  restored log file "0000000100000007000000CF" from archive
2018-08-10 22:39:24.964 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory

At this point there was a temporary problem due to a firewall rule that
blocked the replication connection to the master:

2018-08-10 22:41:35.865 UTC [1289] FATAL:  could not connect to the primary server: SSL connection has been closed unexpectedly

The slave then entered a loop, retrying to scp the next file from the backup
archive (and failing) and connecting to the master for streaming replication
(and also failing).  That is as expected, except for the "invalid magic number":

scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-10 22:41:37.018 UTC [615] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
2018-08-10 22:43:46.936 UTC [1445] FATAL:  could not connect to the primary server: SSL connection has been closed unexpectedly

Note that it's complaining about an invalid magic number in log segment D1, which is
the one that has just failed to scp.  That looked suspicious to me and I remembered
notes in the docs about ensuring that the archive and/or restore commands return an
error exit status when they fail; I checked that and yes scp does exit(1) when the
requested file doesn't exist:

$ scp backup:/egheriugherg /tmp/a
scp: /egheriugherg: No such file or directory
$ echo $?
1

Anyway, the slave continued in its retrying loop as expected, except that each time
it re-fetched the previous segment, D0; this seems a bit peculiar:

2018-08-10 22:44:17.796 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-10 22:46:28.727 UTC [1628] FATAL:  could not connect to the primary server: SSL connection has been closed unexpectedly

(Note the "invalid magic number" was only reported the first time.)

Eventually I fixed the networking issue that had prevented the replication connection
to master and it could start streaming:

2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100000007000000D0 has already been removed

So here is the real problem which worries me: it seems to ask the master for
segment D0, which is one that it did successfully restore from the backup earlier
on, and the master - not unexpectedly - reports that it has already been removed.

I tried restarting the slave:

2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 7/D0FFF088
2018-08-11 00:12:16.068 UTC [7953] LOG:  database system is ready to accept read only connections
scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
2018-08-11 00:12:17.057 UTC [7954] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
2018-08-11 00:12:17.242 UTC [8034] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
2018-08-11 00:12:17.242 UTC [8034] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100000007000000D0 has already been removed

Note that again it failed to get D1 from the backup, as expected, and then
reports "invalid magic number" in that file, and again tries to start
streaming from a segment that the master doesn't have.

Eventually I decided to try and cause the D1 file to exist in the backup:

db=> create table waste_space as select * from large_table;

That caused four new WAL segments to be copied to the backup, and restored
by the slave:

2018-08-11 00:55:49.741 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
2018-08-11 00:56:12.304 UTC [7954] LOG:  restored log file "0000000100000007000000D1" from archive
2018-08-11 00:56:35.481 UTC [7954] LOG:  restored log file "0000000100000007000000D2" from archive
2018-08-11 00:56:57.443 UTC [7954] LOG:  restored log file "0000000100000007000000D3" from archive
2018-08-11 00:57:21.723 UTC [7954] LOG:  restored log file "0000000100000007000000D4" from archive
scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory
2018-08-11 00:57:22.915 UTC [7954] LOG:  unexpected pageaddr 7/C7000000 in log segment 00000001000000070000
00D5, offset 0
2018-08-11 00:57:23.114 UTC [12348] LOG:  started streaming WAL from primary at 7/D5000000 on timeline 1

Note the "unexpected pageaddr" message.  Despite that, the streaming started
and everything was working correctly.  Hurray!

I remain concerned, however, that something bad happened and I'd like to
understand it.  Can anyone help?

Looks like the master recycled the WAL's while the slave could not connect.

Might want to take a look at replication slots to keep WAL's from being recycled:

https://www.postgresql.org/docs/9.6/static/warm-standby.html#STREAMING-REPLICATION-SLOTS

with the heads up that slots do not limit the space taken up by WAL's.


The "unexpected pageaddr" probably is for same reason, per comments in xlogreader.c:

"
/*
 * Check that the address on the page agrees with what we expected.
 * This check typically fails when an old WAL segment is recycled,
 * and hasn't yet been overwritten with new data yet.
 */

...

report_invalid_record(state,
"unexpected pageaddr %X/%X in log segment %s, offset %u",
(uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
                                             fname,
                                             offset);
"



Thanks,

Phil.






--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux