Re: Setting up continuous archiving

Stephen Frost <sfrost@xxxxxxxxxxx> · Wed, 26 Sep 2018 10:46:10 -0400

Greetings,

* Yuri Kanivetsky (yuri.kanivetsky@xxxxxxxxx) wrote:
> I'm trying to compile a basic set of instruction needed to set up
> continuous archiving and to recover from a backup. I'm running
> PostgreSQL 9.3 on Debian Stretch system.

9.3 is about to be end-of-life in just another month or so, see:

https://www.postgresql.org/support/versioning/

As mentioned, this is an extremely complicated subject and you should
really use one of the tools that's been written to do exactly this.
Here's a few comments as to why-

> Setting up continuous archiving
> 
> * Set up WAL archiving
> 
>     * on backup server under postgres user
> 
>         * create /var/lib/postgresql/wal_archive dir
> 
>         $ mkdir /var/lib/postgresql/wal_archive
> 
>     * on database server under postgres user
> 
>         * generate ssh key
> 
>         $ ssh-keygen -f /var/lib/postgresql/.ssh/id_rsa   # providing
> path to key file makes it
>                                                           # to not ask questions
> 
>         * add corresponding record to known_hosts file
> 
>         $ ssh-keyscan -t rsa BACKUP_SRV >> ~/.ssh/known_hosts
> 
>     * locally
> 
>         * authorize login from database to backup server
> 
>         $ ssh DATABASE_SRV cat ~/.ssh/id_rsa.pub | ssh BACKUP_SRV
> 'mkdir --mode 0700 .ssh; cat >> ~/.ssh/authorized_keys; chmod 0600
> .ssh/authorized_keys'
> 
>     * on database server under root
> 
>         * change postgresql.conf
> 
>         wal_level = archive
>         archive_mode = on
>         archive_command = 'rsync -a %p
> BACKUP_SRV:/var/lib/postgresql/wal_archive/%f'

This rsync command does nothing to verify that the WAL file has been
persisted to disk on the backup server, which is a problem if the backup
server crashes or there's some kind of issue with it after the rsync
finishes (you'll end up with gaps in your WAL stream which could prevent
you from being able to restore a backup or from being able to do PITR).

A good backup tool would also calculate a checksum of the WAL file and
store that independently, verify that the WAL file is for the cluster
configured (and not for some other cluster because someone mistakenly
tried to start archiving two primaries into the same location), verify
that the size of the WAL file is what's expected, and probably do a few
other checks that I'm not remembering right now, but which tools like
pgBackRest do.

>         * restart PostgreSQL
> 
>         # systemctl resart postgresql
> 
> * Make a base backup
> 
>     * on database server under root
> 
>         * add a line to postgresql.conf
> 
>         max_wal_senders = 1
> 
>         * add a line to pg_hba.conf
> 
>         host  replication  replication  BACKUP_SRV_IP/BACKUP_SRV_NETMASK  trust
> 
>         * restart PostgreSQL
> 
>         # systemctl restart postgresql
> 
>     * on database server under postgres user
> 
>         * create replication user
> 
>         CREATE USER replication WITH REPLICATION;
> 
>         or
> 
>         $ createuser --replication replication
> 
>     * on backup server under postgres user
> 
>         * make base backup
> 
>         $ pg_basebackup -h DATABASE_SRV -U replication -D
> /var/lib/postgresql/base_backups/$(date +%Y%m%d-%H%M%S)

pg_basebackup is pretty good and it'll soon be able to perform
page-level checksum validation of the database while doing a backup,
assuming checksums have been enabled, but sadly it certainly didn't do
that in 9.3.  pg_basebackup should ensure that everything is persisted
to disk, but it doesn't do anything to protect against latent corruption
happening.  To do that, an independent manifest of the backup needs to
be built which tracks the checksum of every file backed up and then that
needs to be checked when performing a restore.

> Restoring from a backup
> 
> * under root
> 
>     * stop PostgreSQL if running
> 
>     # systemctl stop postgresql
> 
> * under postgres user
> 
>     * move data dir
> 
>     $ mv 9.3{,-$(date +%Y%m%d-%H%M%S)}
> 
>     * copy backup
> 
>     $ mkdir 9.3
>     $ cp -r base_backups/TIMESTAMP 9.3/main
> 
>     * copy unarchived segment files
> 
>     $ find 9.3-TIMESTAMP/main/pg_xlog -maxdepth 1 -type f -exec cp -t
> 9.3/main/pg_xlog {} +

This is not something which I'd generally encourage doing..

>     * create recovery.conf in 9.3/main
> 
>     restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'

This restore command doesn't perform any validation of the WAL file
which is being pulled back from the archive.

> * under root
> 
>     * start PostgreSQL
> 
>     # systemctl start postgresql
> 
> A few notes.
> 
> Running out of space on backup server can lead in its turn to database
> server running out of space, since WAL segment files stop being
> archived and keep piling up. The same might happen when archiving
> falls behind. Which also widens the window for data loss.

Yes, that's a concern.  pgBackRest has an option to allow you to choose
if you want to let the system run out of disk space or if you want to
throw away WAL (of course, leading to the case where you couldn't
perform PITR and so you'd want to do a new full backup as soon as
possible, but at least you have the choice).

> WAL archiving doesn't track changes to configuration files.

No, that's not likely to ever change.

> pg_basebackup will back up configuration files only if they are inside
> data dir.

Right, a number of the backup tools do that.

> If database doesn't generate much WAL traffic, there could be a long
> delay between completing a transaction and archiving its results
> (archive_timeout).

Yes, that's a reason to consider setting archive_timeout and then using
a tool, like pgBackRest, which will compress the WAL files, avoiding
taking up lots of disk space with mostly-empty WAL files.

> You might want to prevent users from accessing the database until
> you're sure the recovery was successful (pg_hba.conf).

Yes, that's certainly an important thing to consider.

> I'm not considering here possible issues with tablespaces and other caveats:
> 
> https://www.postgresql.org/docs/9.3/static/continuous-archiving.html#CONTINUOUS-ARCHIVING-CAVEATS

You really should be looking to upgrade to a more recent version of
PostgreSQL as 9.3 is about to be out of support, and some of those
caveats (eg: hash indexes, at least) are no longer an issue on modern
versions.

> Most importantly, does it makes sense to keep more than one base backup?

Absolutely.  I'd encourage multiple full backups and then also consider
having differential and/or incremental backups as well- one thing to
consider is that WAL replay is a single-threaded and not terribly fast
process.  Having a tool which allows you to do parallel backup/restore
and supports incremental and differential backups, in addition to full
backups, can get you to a system where restores are able to be performed
very quickly, in parallel, with minimal WAL replay time following the
initial restore.

> Also, does it look any good? Does it make sense to make ~/wal_archive
> and ~/base_backups dirs not readable by group and the world? From what
> I can see files in ~/wal_archive are 0600, ~/base_backups/TIMESTAMP is
> 0700. How can I confirm that it's working properly? Is WAL segments
> files appearing in ~/wal_archive enough?

PG will soon be able to support either 0600 or 0640 modes for the
database directory, to allow unprivileged processes to perform backups,
so you might want to consider that.

To confirm it's working properly, you might consider having a regular
verification performed where you swap a WAL segment in PG and make sure
that segment reaches the archive properly, which is exactly what the
'pgbackrest check' command does.

Thanks!

Stephen
Attachment:
signature.asc

Description: PGP signature