Re: ZFS filesystem - supported ?

Stephen Frost <sfrost@xxxxxxxxxxx> · Mon, 1 Nov 2021 17:58:46 -0400

Greetings,

* Lucas (root@xxxxxxx) wrote:
> > On 2/11/2021, at 6:00 AM, Stephen Frost <sfrost@xxxxxxxxxxx> wrote:
> > * Lucas (root@xxxxxxx) wrote:
> >> The snapshots are done this way:
> >> 1. Grab the latest applied WAL File for further references, stores that in a variable in Bash
> >> 2. Stop the Postgres process
> > 
> > How is PostgreSQL stopped here..?
> 
> systemctl stop postgresql

Ok, so that's a clean shutdown of PG then.

> >> 3. Check it is stopped
> >> 4. Start the Block level EBS Snapshot process
> >> 5. Applied tags to the snapshot, such as the WAL file from Step #1
> >> 6. Wait for snapshot to complete, querying the AWS API for that
> >> 7. Start PG
> >> 8. Check it is started
> >> 9. Check it is replicating from master
> > 
> > This doesn't explain how the snapshot is restored.  That's a huge part
> > of making sure that any process is implemented properly.
> 
> The snapshot is restored by creating an EBS volume from the snapshot, attaching it to the brand-new instance. Then, recovery.conf will restore the database to the desire state.

Ok ... but you don't need a recovery.conf at that point, you can just
start up PG, since it was shut down cleanly.  That said, perhaps you'd
want one for other reasons, such as PITR or to bring it up as a replica.

> >> The entire process takes around 10 minutes to complete.
> > 
> > Having the database offline for 10 minutes is a luxury that many don't
> > have; I'm a bit surprised that it's not an issue here, but if it isn't,
> > then that's great.
> 
> I’m doing this on a slave server, so, my application and my databases are still serving customers. No downtime there. Only the slave instance that performs the snapshots goes down.

Ah, sure, that's generally alright, though you don't actually need to
take PG down to perform a backup and you can still perform a backup from
a replica while the replica is running, if you use pg_basebackup or
something like pgbackrest which supports backing up from a replica.

That said, there isn't anything particularly wrong with using a replica
for this except that you won't have quite the same stats or other
information that's different from the primary, which may not be great
for you (same is true if you use pg_basebackup to back up from a
replica, though with pgbackrest the important but small items are copied
from the primary while the rest is pulled from the replica).

> > As noted down-thread though, if you stop PostgreSQL cleanly in step #2
> > (that is, either 'fast' or 'smart' shutdown modes are used) then you
> > don't need to do anything else, a 'shutdown' checkpoint will have been
> > written into the WAL that is included in the snapshot and everything
> > should be good.  Note that this presumes that the snapshot above
> > includes the entire PG data directory, all tablespaces, and the WAL.  If
> > that's not the case then more discussion would be needed.
> > 
> >> I copy the WAL files to S3 in case I ever need to restore an old snapshot (we keep snapshots for 30 days), which won’t have the WAL files in the volume itself. Also for safety reasons, after reading this <https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/> article many many times.
> > 
> > If the database has been cleanly shut down and the pg_wal (pg_xlog)
> > directory of the running database is included in that snapshot then you
> > don't need any other WAL in order to be able to restore an individual
> > snapshot.  If you want to be able to perform PITR
> > (Point-In-Time-Recovery, which you can play forward through the WAL to
> > any point in time that you want to restore to without having to pick one
> > of the snapshots) then you would want to archive *all* of your WAL using
> > archive_command and a tool like pgbackrest.
> > 
> >> We’re working on a migration to PG 14 so, please, feel free to suggest another best way to achieve this, but it would be better in a new thread I reckon.
> > 
> > It isn't terribly clear what you're actually looking for in terms of
> > capability and so it's kind of hard to suggest something to "achieve
> > this".  Having snapshots taken while the database is down and then
> > talking about copying WAL files to s3 to restore those snapshots doesn't
> > strike me as making much sense though and I'd worry about other
> > misunderstandings around how PG writes data and how WAL is handled and
> > how to properly perform PG backups (online and offline) and restores.
> > I'm afraid that this is a quite complicated area and reading blog posts
> > about database outage postmortems (or even our own documentation,
> > frankly) is rarely enough to be able to become an expert in writing
> > tools for that purpose.
> 
> Let me try to explain my setup better. Maybe you guys will understand the process.
> 
> I have a cluster of 1 master and 5 slaves (with streaming-replication), deployed on EC2 instances in AWS. These instances are running Ubuntu 16 with PostgreSQL 9.2, and a migration to PG 14 is already in place using Bucardo. We’re testing our application against PG 14 at the moment.
> 
> Master and read-only slaves are r4.4xlarge. The 5th slave for backups is a r4.large.
> 
> The master takes up all the writes, while 4 slaves process the reads equally. 25% of the read-traffic for each slave. The 5th slave only exists for the EBS Snapshots, it does not serve any read traffic. 
> This is done via Route53 Weighted records. When we migrate to PG 14, we’ll also deploy two PGPool instances (two because of high-availability) and will get rid of the Route53 Weighted records.
> 
> The way our PG instances are setup is:
> / = for the root volume
> /pgsql = for the PG Data directory (i.e. /pgsql/9.2/main)
> 2TB EBS gp3 volume with 8000 IOPS and 440 MB/s of throughput
> /data = for WAL files and PostgreSQL Logs
> The /pgsql/9.2/main/pg_xlog folder is located in /data/wal_files - There is a symlink.
> PG is writing logs to /data/postgresql/logs - instead of /var/log/postgres
> 
> The way we perform snapshots:
> On the 5th slave, we take note of the last applied WAL File
> We stop PG with systemctl stop postgresql
> We create the snapshot using awscli

Presumably you snapshot all of the volumes referenced above as part of
this step..?

> We add custom tags to the snapshot, such as the WAL file from Step 1
> We wait until the snapshot is completed, querying the AWS api
> Once the snapshot is completed, we start postgres with systemctl start postgresql
> Our script checks if PG is actually started, by running some queries. Otherwise we get pinged on Slack and PagerDuty.
> 
> The way we restore a snapshot is:
> We modify our CloudFormation templates with some relevant information, like for example: Stack name, EC2 Instance name, EC2 instance type, etc.
> We launch the new CloudFormation stack - it will search for the latest snapshot using tags, and then it will remember the WAL file tag for later
> CFN creates the new stack, there is a lot involved here, but basically it will:
> Searches the latest snapshot created by the slave
> Create the instance and its volumes
> Gets the WAL file Snapshot tag, searches it in S3 and downloads 50 WAL files prior and all after
> Once all WAL files have been downloaded, it will modify the recovery.conf with relevant information

What does "relevant information" here actually mean though..?  For a
restore where the database was snapshotted after it was shut down
cleanly, as seems to be the case here, you don't need a recovery.conf.

> Starts PG
> At this point, we have a new PG slave applying WAL Files to catch-up with master and it will essentially stream replicate from Master automatically, once the recovery of WAL files have been completed.

So, this is for building out a new replica (in which case you'd need a
recovery.conf for it, sure).  Note that PG14's quite different in how
you configure things for a new replica.

> Then, we have a script that will notify us when the server is ready. The server will automatically register with our Zabbix server and New Relic, for monitoring.
> 
> So, basically, we have the whole process automated, and I am constantly restoring the snapshots to test the backup. I do this once a month, at least.

Always am a fan of folks who test their restore processes. :)

> This have worked for more than 5 years very good. I have never had any issues. The downtime is that I have to keep an EC2 instance with its volumes just for backups. That costs me around USD400 per month.

Well, at least one alternative to performing these snapshots would be to
use a tool like pg_basebackup or pgbackrest to perform the backups
instead.  At least with pgbackrest you can run a backup which pushes the
data directly to s3 for storage (rather than having to use EBS volumes
and snapshots..) and it can back up the stats and such from the primary, and
then use a replica to grab the rest to avoid putting load on the
primary's I/O.  All of this without any of the PG systems involved
having to be shut down, and it can handle doing the restores and writing
the recovery.conf or creating the appropriate signal files for you,
depending on the version of PG that you're running on (in other words,
pgbackrest handles those version differences for you).

Thanks,

Stephen
Attachment:
signature.asc

Description: PGP signature