Re: Replication failure, slave requesting old segments

Stephen Frost <sfrost@xxxxxxxxxxx> · Sat, 11 Aug 2018 18:53:36 -0400

Greetings,

* Phil Endecott (spam_from_pgsql_lists@xxxxxxxxxxxx) wrote:
> Dear Experts,

Since you're asking ...

> I recently set up replication for the first time.  It seemed to be 
> working OK in my initial tests, but didn't cope when the slave was 
> down for a longer period.  This is all with the Debian stable 
> packages of PostgreSQL 9.6.  My replication setup involves a third 
> server, "backup", to and from which WAL files are copied using scp 
> (and essentially never deleted), plus streaming replication in 
> "hot standby" mode.
> 
> On the master, I have:
> 
> wal_level = replica
> archive_mode = on
> archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
>                    scp %p backup:backup/postgresql/archivedir/%f'

This is really not a sufficient or particularly intelligent
archive_command.  In general, I'd strongly recommend against trying to
roll your own archiving/backup solution for PostgreSQL.  In particular,
the above is quite expensive, doesn't address the case where a file is
only partially copied to the backup server, and doesn't ensure that the
file is completely written and fsync'd on the backup server meaning that
a failure will likely result in WAL going missing.  There's much better
solutions out there and which you really should be using instead of
trying to build your own.  In particular, my preference is pgbackrest
(though I'm one of the folks in its development, to be clear), but
there's other options such as barman or WAL-E which I believe also
address the concerns raised above.

> On the slave I have:
> 
> standby_mode = 'on'
> primary_conninfo = 'user=postgres host=master port=5432'
> restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'
> 
> hot_standby = on
> 
> 
> This all seemed to work OK until the slave was shut down for a few days.  
> (During this time the master was also down for a shorter period.) 
> When the slave came back up it successfully copied several WAL files 
> from the backup server until it reached one that hasn't been created yet:

That all sounds fine.

> 2018-08-10 22:37:38.322 UTC [615] LOG:  restored log file "0000000100000007000000CC" from archive
> 2018-08-10 22:38:02.011 UTC [615] LOG:  restored log file "0000000100000007000000CD" from archive
> 2018-08-10 22:38:29.802 UTC [615] LOG:  restored log file "0000000100000007000000CE" from archive
> 2018-08-10 22:38:55.973 UTC [615] LOG:  restored log file "0000000100000007000000CF" from archive
> 2018-08-10 22:39:24.964 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 
> At this point there was a temporary problem due to a firewall rule that 
> blocked the replication connection to the master:
> 
> 2018-08-10 22:41:35.865 UTC [1289] FATAL:  could not connect to the primary server: SSL connection has been closed unexpectedly

Well, that's certainly not good.

> The slave then entered a loop, retrying to scp the next file from the backup 
> archive (and failing) and connecting to the master for streaming replication 
> (and also failing).  That is as expected, except for the "invalid magic number":
> 
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-10 22:41:37.018 UTC [615] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
> 2018-08-10 22:43:46.936 UTC [1445] FATAL:  could not connect to the primary server: SSL connection has been closed unexpectedly

Note that the invalid magic number error is just LOG-level, and that
PostgreSQL will certainly ask for WAL files which don't exist in the
archive yet.

> Note that it's complaining about an invalid magic number in log segment D1, which is 
> the one that has just failed to scp.  That looked suspicious to me and I remembered 
> notes in the docs about ensuring that the archive and/or restore commands return an 
> error exit status when they fail; I checked that and yes scp does exit(1) when the 
> requested file doesn't exist:

Sure, but scp doesn't produce any output when it's successful, so are
you sure that an scp wasn't performed after the "no such file or
directory" log message and which perhaps performed a partial copy
(see above about *exactly* that concern being mentioned by me, even
before getting down to this part, due to using these simple unix
commands...)?

> $ scp backup:/egheriugherg /tmp/a
> scp: /egheriugherg: No such file or directory
> $ echo $?
> 1

scp has no idea if it made a copy of a partial file though.

> Anyway, the slave continued in its retrying loop as expected, except that each time 
> it re-fetched the previous segment, D0; this seems a bit peculiar:
> 
> 2018-08-10 22:44:17.796 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-10 22:46:28.727 UTC [1628] FATAL:  could not connect to the primary server: SSL connection has been closed unexpectedly

Having it go back to D0 is at least a bit interesting.  I wonder if
somehow it was partial..  Have you looked at the files on the archive
server to make sure they're all 16MB and the same size?

> (Note the "invalid magic number" was only reported the first time.)

Part of the issue here is that there's no logging happening with these
scp commands, so we don't really know what happened, exactly.  I'm
wondering if perhaps the scp has *just* started copying the file from
the primary to the backup server when the scp from the backup server to
the replica started and it ended up copying a zero-byte file, though if
that was the case then the next time around the file should have been
there, unless there was some network hiccup between the primary and the
backup server that caused the copy to fail...

> Eventually I fixed the networking issue that had prevented the replication connection 
> to master and it could start streaming:
> 
> 2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
> 2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100000007000000D0 has already been removed
> 
> So here is the real problem which worries me: it seems to ask the master for 
> segment D0, which is one that it did successfully restore from the backup earlier 
> on, and the master - not unexpectedly - reports that it has already been removed.

If D0 has been removed from the primary then it should be on the backup
server, right?  Are you sure it actually is, and that it's a complete
segment?  I'm starting to think that you're in exactly the predicament
that I mentioned above- maybe D0 was a partial copy to the backup
server, and that failed and was returned to PG as failed but then it
tried to run again and the stupid 'test' command returned success
because the file exists, even though it wasn't complete...

> I tried restarting the slave:
> 
> 2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
> 2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
> 2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 7/D0FFF088
> 2018-08-11 00:12:16.068 UTC [7953] LOG:  database system is ready to accept read only connections
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-11 00:12:17.057 UTC [7954] LOG:  invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
> 2018-08-11 00:12:17.242 UTC [8034] LOG:  started streaming WAL from primary at 7/D0000000 on timeline 1
> 2018-08-11 00:12:17.242 UTC [8034] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100000007000000D0 has already been removed
> 
> Note that again it failed to get D1 from the backup, as expected, and then 
> reports "invalid magic number" in that file, and again tries to start 
> streaming from a segment that the master doesn't have.
> 
> Eventually I decided to try and cause the D1 file to exist in the backup:
> 
> db=> create table waste_space as select * from large_table;
> 
> That caused four new WAL segments to be copied to the backup, and restored 
> by the slave:
> 
> 2018-08-11 00:55:49.741 UTC [7954] LOG:  restored log file "0000000100000007000000D0" from archive
> 2018-08-11 00:56:12.304 UTC [7954] LOG:  restored log file "0000000100000007000000D1" from archive
> 2018-08-11 00:56:35.481 UTC [7954] LOG:  restored log file "0000000100000007000000D2" from archive
> 2018-08-11 00:56:57.443 UTC [7954] LOG:  restored log file "0000000100000007000000D3" from archive
> 2018-08-11 00:57:21.723 UTC [7954] LOG:  restored log file "0000000100000007000000D4" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory
> 2018-08-11 00:57:22.915 UTC [7954] LOG:  unexpected pageaddr 7/C7000000 in log segment 00000001000000070000
> 00D5, offset 0
> 2018-08-11 00:57:23.114 UTC [12348] LOG:  started streaming WAL from primary at 7/D5000000 on timeline 1
> 
> Note the "unexpected pageaddr" message.  Despite that, the streaming started 
> and everything was working correctly.  Hurray!

So, just to be clear, those LOG-level messages aren't generally an
issue, they're just LOGs.  We expect to see them.

> I remain concerned, however, that something bad happened and I'd like to 
> understand it.  Can anyone help?

Really hard to tell from this.  On the one hand, it seems like maybe the
primary was done with D0, but hadn't finished with D1 yet, but the
replica hadn't made that transistion from D0 -> D1, hence why it was
still asking for D0.  I'm certainly concerned that there might have
been some amount of partial-WAL-file copying happening, but PG is pretty
good at dealing with a variety of odd cases and so I tend to doubt that
the replica actually ended up being corrupted, but I'll again strongly
advocate for using a proven solution which avoids these risks by doing
things like copying into temporary files on the backup server and then
using an atomic 'mv' to make sure that anyone reading from the archive
wouldn't ever see a partially-written file, and, just to double-check,
also calculate a hash of every WAL file written into the archive and
re-verify that hash when reading it back out.  That's what we do in
pgbackrest, at least.

Thanks!

Stephen
Attachment:
signature.asc

Description: PGP signature