Greetings, * Phil Endecott (spam_from_pgsql_lists@xxxxxxxxxxxx) wrote: > Stephen Frost wrote: > >* Phil Endecott (spam_from_pgsql_lists@xxxxxxxxxxxx) wrote: > >>archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f && > >> scp %p backup:backup/postgresql/archivedir/%f' > > > >This is really not a sufficient or particularly intelligent > >archive_command. In general, I'd strongly recommend against trying to > >roll your own archiving/backup solution for PostgreSQL. In particular, > >the above is quite expensive, doesn't address the case where a file is > >only partially copied to the backup server, and doesn't ensure that the > >file is completely written and fsync'd on the backup server meaning that > >a failure will likely result in WAL going missing. There's much better > >solutions out there and which you really should be using instead of > >trying to build your own. In particular, my preference is pgbackrest > >(though I'm one of the folks in its development, to be clear), but > >there's other options such as barman or WAL-E which I believe also > >address the concerns raised above. > > Hmmm, well I based it on what I read in the documentation: > > https://www.postgresql.org/docs/9.6/static/continuous-archiving.html > > The expense is not an issue in this case. I would be surprised if > scp were to exit(0) after only partially copying a file but perhaps > you know something I don't about that. The scp might not, but so what? If a partial file remains on the far side, then a subsequent invokation of that archive_command will come back saying that the file is there. Further, the file isn't going to actually be sync'd to the filesystem on the far side, so even if the scp returns success, a failure on the backup server could result in loss of the WAL file. > I know that rsync creates a temporary file and moves it into place when > all the data has been transferred, which scp doesn't do; perhaps that > would be safer. No, what would be better is using a tool which will actually fsync() the file on the backup server after writing it out there, making sure it's been stored to the filesystem, and then to perform an atomic mv operation to put it in place, and then to fsync() the directory to ensure that change has also been sync'd to disk. Even better is to calculate a hash of the file and to reverify that on restore, to make sure that it doesn't end up getting corrupted on the backup server somehow. > The master was only writing WAL files maybe 3 or 4 times per day at > the time, so any scenario that requires its crash to have occurred > exactly as it was doing an scp involves a substantial coincidence. While I get that it's unlikely to be relevant to this particular case, I wouldn't recommend depending on never having that coincidence happen. What I've seen quite a bit of, just so you know, is that things tend to all break down at about the same time- so right when something bad is happening on the primary, the backup server runs into issues too. This might be due to networking breaking down in a particular area all at once or shared storage having some issue but it's happened often enough that relying on uncommon things to avoid each other is something I've got a hard time doing these days. > I am limited in what I can run on the backup server. You could consider backing up locally using a tool which is built for PostgreSQL and then sync'ing the results of that backup over to the backup server, perhaps. Or point out that you need a backup server where you don't have those restirctions in order to properly have backups. Or use an external service like s3 (or a clone) to back up to. > >Having it go back to D0 is at least a bit interesting. I wonder if > >somehow it was partial.. Have you looked at the files on the archive > >server to make sure they're all 16MB and the same size? > > Yes, all the files on the backup including D0 were 16 MB. But: Ok. Did you see any failures on the sending side when archive_command was being run? PG shouldn't start trying to archive the file until it's done with it, of course, so unless something failed right at the end of the scp and left the file at 16MB but it wasn't actually filled.. > >>2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive > >>2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0 > >>2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088 > > Are the last two log lines above telling us anything useful? Is that > saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as > far as byte 0xf956c0 or 0xfff088? Is that what we would expect? Is it > trying to use the streaming connection to get the missing bytes from > FFFF088 to FFFFFFFF? Is that just an empty gap at the end of the file > due to the next record being too big to fit? The short answer is that, yes, the next record was likely too big to fit, but that's what the replica was trying to figure out and couldn't because D0 was gone from the primary already. One thing which should help this would be to use physical replication slots on the primary, which would keep it from throwing away WAL files until it knows the replica has them, but that runs the risk of ending up with lots of extra WAL on the primary if the replica is gone for a while. If you'd prefer to avoid that then having wal_keep_segments set to '1' would avoid this particular issue as well, I'd expect. I do wonder if perhaps we should just default to having it as '1' to avoid exactly this case, as it seems like perhaps PG archived D0 and then flipped to D1 and got rid of D0, which is all pretty reasonable, except that a replica trying to catch up is going to end up asking for D0 from the primary because it didn't know if there was anything else that should have been in D0.. > >>db=> create table waste_space as select * from large_table; > >> > >>That caused four new WAL segments to be copied to the backup, and > >>restored by the slave: > > Note again that this was sufficient to make it start working again, > which I think is important. I agree that it's certainly interesting though presumably what ends up happening is that the replica finally gets to D1 and is then able to connect to the primary before the primary has finished with D1 and it's able to start tracking along with the primary. > >Really hard to tell from this. On the one hand, it seems like maybe the > >primary was done with D0, but hadn't finished with D1 yet, but the > >replica hadn't made that transistion from D0 -> D1, hence why it was > >still asking for D0. > > Yes. What is actually supposed to happen? What exactly do you mean by > "make the transition"? Simply moving along the WAL stream to get to the D1 WAL file. This all more-or-less looks like what's "supposed to happen" except that the lack of additional writes to the primary meant that things appeared to get "stuck" here, and then got "unstuck" once you created some data. Looks like this might be a bit of an edge case that we aren't handling very cleanly. > >I'm certainly concerned that there might have > >been some amount of partial-WAL-file copying happening, but PG is pretty > >good at dealing with a variety of odd cases and so I tend to doubt that > >the replica actually ended up being corrupted, but I'll again strongly > >advocate for using a proven solution which avoids these risks by doing > >things like copying into temporary files on the backup server and then > >using an atomic 'mv' to make sure that anyone reading from the archive > >wouldn't ever see a partially-written file, and, just to double-check, > >also calculate a hash of every WAL file written into the archive and > >re-verify that hash when reading it back out. That's what we do in > >pgbackrest, at least. > > Noted. But I don't see any evidence that that was actually the root > cause here, and it relies on an unlikely timing coincidence. This is > the very first time that I've had a replica recover after a significant > downtime, and it failed. Failed is really over-stating things, isn't it? The replica caught all the way up to D0 just fine, but it just couldn't get past that because the primary had removed D0 already, and the replica couldn't actually reconnect to the primary to start streaming again, but it's not like it would have been all *that* far behind the primary since the primary was still working on D1, and once the primary did write a few things, the replica was able to start streaming again. If this can be reproduced and then a fix put in place for it, I'm all for that, to be clear, but I don't see this as a particularly big issue or a systemic problem with the way WAL replay and replication work. Thanks! Stephen
Attachment:
signature.asc
Description: PGP signature