Greetings, * Phil Endecott (spam_from_pgsql_lists@xxxxxxxxxxxx) wrote: > Dear Experts, Since you're asking ... > I recently set up replication for the first time. It seemed to be > working OK in my initial tests, but didn't cope when the slave was > down for a longer period. This is all with the Debian stable > packages of PostgreSQL 9.6. My replication setup involves a third > server, "backup", to and from which WAL files are copied using scp > (and essentially never deleted), plus streaming replication in > "hot standby" mode. > > On the master, I have: > > wal_level = replica > archive_mode = on > archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f && > scp %p backup:backup/postgresql/archivedir/%f' This is really not a sufficient or particularly intelligent archive_command. In general, I'd strongly recommend against trying to roll your own archiving/backup solution for PostgreSQL. In particular, the above is quite expensive, doesn't address the case where a file is only partially copied to the backup server, and doesn't ensure that the file is completely written and fsync'd on the backup server meaning that a failure will likely result in WAL going missing. There's much better solutions out there and which you really should be using instead of trying to build your own. In particular, my preference is pgbackrest (though I'm one of the folks in its development, to be clear), but there's other options such as barman or WAL-E which I believe also address the concerns raised above. > On the slave I have: > > standby_mode = 'on' > primary_conninfo = 'user=postgres host=master port=5432' > restore_command = 'scp backup:backup/postgresql/archivedir/%f %p' > > hot_standby = on > > > This all seemed to work OK until the slave was shut down for a few days. > (During this time the master was also down for a shorter period.) > When the slave came back up it successfully copied several WAL files > from the backup server until it reached one that hasn't been created yet: That all sounds fine. > 2018-08-10 22:37:38.322 UTC [615] LOG: restored log file "0000000100000007000000CC" from archive > 2018-08-10 22:38:02.011 UTC [615] LOG: restored log file "0000000100000007000000CD" from archive > 2018-08-10 22:38:29.802 UTC [615] LOG: restored log file "0000000100000007000000CE" from archive > 2018-08-10 22:38:55.973 UTC [615] LOG: restored log file "0000000100000007000000CF" from archive > 2018-08-10 22:39:24.964 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > > At this point there was a temporary problem due to a firewall rule that > blocked the replication connection to the master: > > 2018-08-10 22:41:35.865 UTC [1289] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly Well, that's certainly not good. > The slave then entered a loop, retrying to scp the next file from the backup > archive (and failing) and connecting to the master for streaming replication > (and also failing). That is as expected, except for the "invalid magic number": > > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-10 22:41:37.018 UTC [615] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0 > 2018-08-10 22:43:46.936 UTC [1445] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly Note that the invalid magic number error is just LOG-level, and that PostgreSQL will certainly ask for WAL files which don't exist in the archive yet. > Note that it's complaining about an invalid magic number in log segment D1, which is > the one that has just failed to scp. That looked suspicious to me and I remembered > notes in the docs about ensuring that the archive and/or restore commands return an > error exit status when they fail; I checked that and yes scp does exit(1) when the > requested file doesn't exist: Sure, but scp doesn't produce any output when it's successful, so are you sure that an scp wasn't performed after the "no such file or directory" log message and which perhaps performed a partial copy (see above about *exactly* that concern being mentioned by me, even before getting down to this part, due to using these simple unix commands...)? > $ scp backup:/egheriugherg /tmp/a > scp: /egheriugherg: No such file or directory > $ echo $? > 1 scp has no idea if it made a copy of a partial file though. > Anyway, the slave continued in its retrying loop as expected, except that each time > it re-fetched the previous segment, D0; this seems a bit peculiar: > > 2018-08-10 22:44:17.796 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-10 22:46:28.727 UTC [1628] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly Having it go back to D0 is at least a bit interesting. I wonder if somehow it was partial.. Have you looked at the files on the archive server to make sure they're all 16MB and the same size? > (Note the "invalid magic number" was only reported the first time.) Part of the issue here is that there's no logging happening with these scp commands, so we don't really know what happened, exactly. I'm wondering if perhaps the scp has *just* started copying the file from the primary to the backup server when the scp from the backup server to the replica started and it ended up copying a zero-byte file, though if that was the case then the next time around the file should have been there, unless there was some network hiccup between the primary and the backup server that caused the copy to fail... > Eventually I fixed the networking issue that had prevented the replication connection > to master and it could start streaming: > > 2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 > 2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removed > > So here is the real problem which worries me: it seems to ask the master for > segment D0, which is one that it did successfully restore from the backup earlier > on, and the master - not unexpectedly - reports that it has already been removed. If D0 has been removed from the primary then it should be on the backup server, right? Are you sure it actually is, and that it's a complete segment? I'm starting to think that you're in exactly the predicament that I mentioned above- maybe D0 was a partial copy to the backup server, and that failed and was returned to PG as failed but then it tried to run again and the stupid 'test' command returned success because the file exists, even though it wasn't complete... > I tried restarting the slave: > > 2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive > 2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0 > 2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088 > 2018-08-11 00:12:16.068 UTC [7953] LOG: database system is ready to accept read only connections > scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory > 2018-08-11 00:12:17.057 UTC [7954] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0 > 2018-08-11 00:12:17.242 UTC [8034] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1 > 2018-08-11 00:12:17.242 UTC [8034] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removed > > Note that again it failed to get D1 from the backup, as expected, and then > reports "invalid magic number" in that file, and again tries to start > streaming from a segment that the master doesn't have. > > Eventually I decided to try and cause the D1 file to exist in the backup: > > db=> create table waste_space as select * from large_table; > > That caused four new WAL segments to be copied to the backup, and restored > by the slave: > > 2018-08-11 00:55:49.741 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive > 2018-08-11 00:56:12.304 UTC [7954] LOG: restored log file "0000000100000007000000D1" from archive > 2018-08-11 00:56:35.481 UTC [7954] LOG: restored log file "0000000100000007000000D2" from archive > 2018-08-11 00:56:57.443 UTC [7954] LOG: restored log file "0000000100000007000000D3" from archive > 2018-08-11 00:57:21.723 UTC [7954] LOG: restored log file "0000000100000007000000D4" from archive > scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory > 2018-08-11 00:57:22.915 UTC [7954] LOG: unexpected pageaddr 7/C7000000 in log segment 00000001000000070000 > 00D5, offset 0 > 2018-08-11 00:57:23.114 UTC [12348] LOG: started streaming WAL from primary at 7/D5000000 on timeline 1 > > Note the "unexpected pageaddr" message. Despite that, the streaming started > and everything was working correctly. Hurray! So, just to be clear, those LOG-level messages aren't generally an issue, they're just LOGs. We expect to see them. > I remain concerned, however, that something bad happened and I'd like to > understand it. Can anyone help? Really hard to tell from this. On the one hand, it seems like maybe the primary was done with D0, but hadn't finished with D1 yet, but the replica hadn't made that transistion from D0 -> D1, hence why it was still asking for D0. I'm certainly concerned that there might have been some amount of partial-WAL-file copying happening, but PG is pretty good at dealing with a variety of odd cases and so I tend to doubt that the replica actually ended up being corrupted, but I'll again strongly advocate for using a proven solution which avoids these risks by doing things like copying into temporary files on the backup server and then using an atomic 'mv' to make sure that anyone reading from the archive wouldn't ever see a partially-written file, and, just to double-check, also calculate a hash of every WAL file written into the archive and re-verify that hash when reading it back out. That's what we do in pgbackrest, at least. Thanks! Stephen
Attachment:
signature.asc
Description: PGP signature