Hi Stephen,
Stephen Frost wrote:
* Phil Endecott (spam_from_pgsql_lists@xxxxxxxxxxxx) wrote:
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
scp %p backup:backup/postgresql/archivedir/%f'
This is really not a sufficient or particularly intelligent
archive_command. In general, I'd strongly recommend against trying to
roll your own archiving/backup solution for PostgreSQL. In particular,
the above is quite expensive, doesn't address the case where a file is
only partially copied to the backup server, and doesn't ensure that the
file is completely written and fsync'd on the backup server meaning that
a failure will likely result in WAL going missing. There's much better
solutions out there and which you really should be using instead of
trying to build your own. In particular, my preference is pgbackrest
(though I'm one of the folks in its development, to be clear), but
there's other options such as barman or WAL-E which I believe also
address the concerns raised above.
Hmmm, well I based it on what I read in the documentation:
https://www.postgresql.org/docs/9.6/static/continuous-archiving.html
The expense is not an issue in this case. I would be surprised if
scp were to exit(0) after only partially copying a file but perhaps
you know something I don't about that.
I know that rsync creates a temporary file and moves it into place when
all the data has been transferred, which scp doesn't do; perhaps that
would be safer.
The master was only writing WAL files maybe 3 or 4 times per day at
the time, so any scenario that requires its crash to have occurred
exactly as it was doing an scp involves a substantial coincidence.
I am limited in what I can run on the backup server.
Having it go back to D0 is at least a bit interesting. I wonder if
somehow it was partial.. Have you looked at the files on the archive
server to make sure they're all 16MB and the same size?
Yes, all the files on the backup including D0 were 16 MB. But:
2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive
2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088
Are the last two log lines above telling us anything useful? Is that
saying that, of the 16 MB (0x1000000 byte) WAL file it restored only as
far as byte 0xf956c0 or 0xfff088? Is that what we would expect? Is it
trying to use the streaming connection to get the missing bytes from
FFFF088 to FFFFFFFF? Is that just an empty gap at the end of the file
due to the next record being too big to fit?
db=> create table waste_space as select * from large_table;
That caused four new WAL segments to be copied to the backup, and restored
by the slave:
Note again that this was sufficient to make it start working again,
which I think is important.
Really hard to tell from this. On the one hand, it seems like maybe the
primary was done with D0, but hadn't finished with D1 yet, but the
replica hadn't made that transistion from D0 -> D1, hence why it was
still asking for D0.
Yes. What is actually supposed to happen? What exactly do you mean by
"make the transition"?
I'm certainly concerned that there might have
been some amount of partial-WAL-file copying happening, but PG is pretty
good at dealing with a variety of odd cases and so I tend to doubt that
the replica actually ended up being corrupted, but I'll again strongly
advocate for using a proven solution which avoids these risks by doing
things like copying into temporary files on the backup server and then
using an atomic 'mv' to make sure that anyone reading from the archive
wouldn't ever see a partially-written file, and, just to double-check,
also calculate a hash of every WAL file written into the archive and
re-verify that hash when reading it back out. That's what we do in
pgbackrest, at least.
Noted. But I don't see any evidence that that was actually the root
cause here, and it relies on an unlikely timing coincidence. This is
the very first time that I've had a replica recover after a significant
downtime, and it failed.
Regards, Phil.