Re: ZFS filesystem - supported ?

Lucas <root@xxxxxxx> · Tue, 2 Nov 2021 10:41:07 +1300

On 2/11/2021, at 6:00 AM, Stephen Frost <sfrost@xxxxxxxxxxx> wrote:

Greetings,

* Lucas (root@xxxxxxx) wrote:
On 27/10/2021, at 8:35 AM, Stephen Frost <sfrost@xxxxxxxxxxx> wrote:
I do want to again stress that I don't recommend writing your own tools
for doing backup/restore/PITR and I would caution people against people
trying to use this approach you've suggested.  Also, being able to tell
when such a process *doesn't* work is non-trivial (look at how long it
took us to discover the issues around fsync..), so saying that it seems
to have worked for a long time for you isn't really enough to make me
feel comfortable with it.

I think it is worth mentioning that we’re using PG 9.2, so I don’t benefit from features like logical replication to help with backups and such.
(Yes, I know it’s old and that I should upgrade ASAP. We’re working on it)

You'll certainly want to upgrade, of course.  That said, logical
replication isn't for backups, so not sure how that's relevant here.

Agreed. Logical replication isn’t a backup. That’s why we take snapshots and ship the WAL files to an S3 bucket.

The snapshots are done this way:
1. Grab the latest applied WAL File for further references, stores that in a variable in Bash
2. Stop the Postgres process

How is PostgreSQL stopped here..?

systemctl stop postgresql

3. Check it is stopped
4. Start the Block level EBS Snapshot process
5. Applied tags to the snapshot, such as the WAL file from Step #1
6. Wait for snapshot to complete, querying the AWS API for that
7. Start PG
8. Check it is started
9. Check it is replicating from master

This doesn't explain how the snapshot is restored.  That's a huge part
of making sure that any process is implemented properly.

The snapshot is restored by creating an EBS volume from the snapshot, attaching it to the brand-new instance. Then, recovery.conf will restore the database to the desire state.

The entire process takes around 10 minutes to complete.

Having the database offline for 10 minutes is a luxury that many don't
have; I'm a bit surprised that it's not an issue here, but if it isn't,
then that's great.

I’m doing this on a slave server, so, my application and my databases are still serving customers. No downtime there. Only the slave instance that performs the snapshots goes down.

As noted down-thread though, if you stop PostgreSQL cleanly in step #2
(that is, either 'fast' or 'smart' shutdown modes are used) then you
don't need to do anything else, a 'shutdown' checkpoint will have been
written into the WAL that is included in the snapshot and everything
should be good.  Note that this presumes that the snapshot above
includes the entire PG data directory, all tablespaces, and the WAL.  If
that's not the case then more discussion would be needed.

I copy the WAL files to S3 in case I ever need to restore an old snapshot (we keep snapshots for 30 days), which won’t have the WAL files in the volume itself. Also for safety reasons, after reading this <https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/> article many many times.

If the database has been cleanly shut down and the pg_wal (pg_xlog)
directory of the running database is included in that snapshot then you
don't need any other WAL in order to be able to restore an individual
snapshot.  If you want to be able to perform PITR
(Point-In-Time-Recovery, which you can play forward through the WAL to
any point in time that you want to restore to without having to pick one
of the snapshots) then you would want to archive *all* of your WAL using
archive_command and a tool like pgbackrest.

We’re working on a migration to PG 14 so, please, feel free to suggest another best way to achieve this, but it would be better in a new thread I reckon.

It isn't terribly clear what you're actually looking for in terms of
capability and so it's kind of hard to suggest something to "achieve
this".  Having snapshots taken while the database is down and then
talking about copying WAL files to s3 to restore those snapshots doesn't
strike me as making much sense though and I'd worry about other
misunderstandings around how PG writes data and how WAL is handled and
how to properly perform PG backups (online and offline) and restores.
I'm afraid that this is a quite complicated area and reading blog posts
about database outage postmortems (or even our own documentation,
frankly) is rarely enough to be able to become an expert in writing
tools for that purpose.

Thanks,

Stephen

Let me try to explain my setup better. Maybe you guys will understand the process.

I have a cluster of 1 master and 5 slaves (with streaming-replication), deployed on EC2 instances in AWS. These instances are running Ubuntu 16 with PostgreSQL 9.2, and a migration to PG 14 is already in place using Bucardo. We’re testing our application against PG 14 at the moment.

Master and read-only slaves are r4.4xlarge. The 5th slave for backups is a r4.large.

The master takes up all the writes, while 4 slaves process the reads equally. 25% of the read-traffic for each slave. The 5th slave only exists for the EBS Snapshots, it does not serve any read traffic. 
This is done via Route53 Weighted records. When we migrate to PG 14, we’ll also deploy two PGPool instances (two because of high-availability) and will get rid of the Route53 Weighted records.

The way our PG instances are setup is:
/ = for the root volume
/pgsql = for the PG Data directory (i.e. /pgsql/9.2/main)
2TB EBS gp3 volume with 8000 IOPS and 440 MB/s of throughput
/data = "" WAL files and PostgreSQL Logs
The /pgsql/9.2/main/pg_xlog folder is located in /data/wal_files - There is a symlink.
PG is writing logs to /data/postgresql/logs - instead of /var/log/postgres

The way we perform snapshots:
On the 5th slave, we take note of the last applied WAL File
We stop PG with systemctl stop postgresql
We create the snapshot using awscli
We add custom tags to the snapshot, such as the WAL file from Step 1
We wait until the snapshot is completed, querying the AWS api
Once the snapshot is completed, we start postgres with systemctl start postgresql
Our script checks if PG is actually started, by running some queries. Otherwise we get pinged on Slack and PagerDuty.

The way we restore a snapshot is:
We modify our CloudFormation templates with some relevant information, like for example: Stack name, EC2 Instance name, EC2 instance type, etc.
We launch the new CloudFormation stack - it will search for the latest snapshot using tags, and then it will remember the WAL file tag for later
CFN creates the new stack, there is a lot involved here, but basically it will:
Searches the latest snapshot created by the slave
Create the instance and its volumes
Gets the WAL file Snapshot tag, searches it in S3 and downloads 50 WAL files prior and all after
Once all WAL files have been downloaded, it will modify the recovery.conf with relevant information
Starts PG
At this point, we have a new PG slave applying WAL Files to catch-up with master and it will essentially stream replicate from Master automatically, once the recovery of WAL files have been completed.
Then, we have a script that will notify us when the server is ready. The server will automatically register with our Zabbix server and New Relic, for monitoring.

So, basically, we have the whole process automated, and I am constantly restoring the snapshots to test the backup. I do this once a month, at least.

This have worked for more than 5 years very good. I have never had any issues. The downtime is that I have to keep an EC2 instance with its volumes just for backups. That costs me around USD400 per month.