Greetings, * Yuri Kanivetsky (yuri.kanivetsky@xxxxxxxxx) wrote: > I'm trying to compile a basic set of instruction needed to set up > continuous archiving and to recover from a backup. I'm running > PostgreSQL 9.3 on Debian Stretch system. 9.3 is about to be end-of-life in just another month or so, see: https://www.postgresql.org/support/versioning/ As mentioned, this is an extremely complicated subject and you should really use one of the tools that's been written to do exactly this. Here's a few comments as to why- > Setting up continuous archiving > > * Set up WAL archiving > > * on backup server under postgres user > > * create /var/lib/postgresql/wal_archive dir > > $ mkdir /var/lib/postgresql/wal_archive > > * on database server under postgres user > > * generate ssh key > > $ ssh-keygen -f /var/lib/postgresql/.ssh/id_rsa # providing > path to key file makes it > # to not ask questions > > * add corresponding record to known_hosts file > > $ ssh-keyscan -t rsa BACKUP_SRV >> ~/.ssh/known_hosts > > * locally > > * authorize login from database to backup server > > $ ssh DATABASE_SRV cat ~/.ssh/id_rsa.pub | ssh BACKUP_SRV > 'mkdir --mode 0700 .ssh; cat >> ~/.ssh/authorized_keys; chmod 0600 > .ssh/authorized_keys' > > * on database server under root > > * change postgresql.conf > > wal_level = archive > archive_mode = on > archive_command = 'rsync -a %p > BACKUP_SRV:/var/lib/postgresql/wal_archive/%f' This rsync command does nothing to verify that the WAL file has been persisted to disk on the backup server, which is a problem if the backup server crashes or there's some kind of issue with it after the rsync finishes (you'll end up with gaps in your WAL stream which could prevent you from being able to restore a backup or from being able to do PITR). A good backup tool would also calculate a checksum of the WAL file and store that independently, verify that the WAL file is for the cluster configured (and not for some other cluster because someone mistakenly tried to start archiving two primaries into the same location), verify that the size of the WAL file is what's expected, and probably do a few other checks that I'm not remembering right now, but which tools like pgBackRest do. > * restart PostgreSQL > > # systemctl resart postgresql > > * Make a base backup > > * on database server under root > > * add a line to postgresql.conf > > max_wal_senders = 1 > > * add a line to pg_hba.conf > > host replication replication BACKUP_SRV_IP/BACKUP_SRV_NETMASK trust > > * restart PostgreSQL > > # systemctl restart postgresql > > * on database server under postgres user > > * create replication user > > CREATE USER replication WITH REPLICATION; > > or > > $ createuser --replication replication > > * on backup server under postgres user > > * make base backup > > $ pg_basebackup -h DATABASE_SRV -U replication -D > /var/lib/postgresql/base_backups/$(date +%Y%m%d-%H%M%S) pg_basebackup is pretty good and it'll soon be able to perform page-level checksum validation of the database while doing a backup, assuming checksums have been enabled, but sadly it certainly didn't do that in 9.3. pg_basebackup should ensure that everything is persisted to disk, but it doesn't do anything to protect against latent corruption happening. To do that, an independent manifest of the backup needs to be built which tracks the checksum of every file backed up and then that needs to be checked when performing a restore. > Restoring from a backup > > * under root > > * stop PostgreSQL if running > > # systemctl stop postgresql > > * under postgres user > > * move data dir > > $ mv 9.3{,-$(date +%Y%m%d-%H%M%S)} > > * copy backup > > $ mkdir 9.3 > $ cp -r base_backups/TIMESTAMP 9.3/main > > * copy unarchived segment files > > $ find 9.3-TIMESTAMP/main/pg_xlog -maxdepth 1 -type f -exec cp -t > 9.3/main/pg_xlog {} + This is not something which I'd generally encourage doing.. > * create recovery.conf in 9.3/main > > restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p' This restore command doesn't perform any validation of the WAL file which is being pulled back from the archive. > * under root > > * start PostgreSQL > > # systemctl start postgresql > > A few notes. > > Running out of space on backup server can lead in its turn to database > server running out of space, since WAL segment files stop being > archived and keep piling up. The same might happen when archiving > falls behind. Which also widens the window for data loss. Yes, that's a concern. pgBackRest has an option to allow you to choose if you want to let the system run out of disk space or if you want to throw away WAL (of course, leading to the case where you couldn't perform PITR and so you'd want to do a new full backup as soon as possible, but at least you have the choice). > WAL archiving doesn't track changes to configuration files. No, that's not likely to ever change. > pg_basebackup will back up configuration files only if they are inside > data dir. Right, a number of the backup tools do that. > If database doesn't generate much WAL traffic, there could be a long > delay between completing a transaction and archiving its results > (archive_timeout). Yes, that's a reason to consider setting archive_timeout and then using a tool, like pgBackRest, which will compress the WAL files, avoiding taking up lots of disk space with mostly-empty WAL files. > You might want to prevent users from accessing the database until > you're sure the recovery was successful (pg_hba.conf). Yes, that's certainly an important thing to consider. > I'm not considering here possible issues with tablespaces and other caveats: > > https://www.postgresql.org/docs/9.3/static/continuous-archiving.html#CONTINUOUS-ARCHIVING-CAVEATS You really should be looking to upgrade to a more recent version of PostgreSQL as 9.3 is about to be out of support, and some of those caveats (eg: hash indexes, at least) are no longer an issue on modern versions. > Most importantly, does it makes sense to keep more than one base backup? Absolutely. I'd encourage multiple full backups and then also consider having differential and/or incremental backups as well- one thing to consider is that WAL replay is a single-threaded and not terribly fast process. Having a tool which allows you to do parallel backup/restore and supports incremental and differential backups, in addition to full backups, can get you to a system where restores are able to be performed very quickly, in parallel, with minimal WAL replay time following the initial restore. > Also, does it look any good? Does it make sense to make ~/wal_archive > and ~/base_backups dirs not readable by group and the world? From what > I can see files in ~/wal_archive are 0600, ~/base_backups/TIMESTAMP is > 0700. How can I confirm that it's working properly? Is WAL segments > files appearing in ~/wal_archive enough? PG will soon be able to support either 0600 or 0640 modes for the database directory, to allow unprivileged processes to perform backups, so you might want to consider that. To confirm it's working properly, you might consider having a regular verification performed where you swap a WAL segment in PG and make sure that segment reaches the archive properly, which is exactly what the 'pgbackrest check' command does. Thanks! Stephen
Attachment:
signature.asc
Description: PGP signature