Re: Postgres restore sometimes restores to a point 2 days in the past

Koen De Groote <kdg.dev@xxxxxxxxx> · Fri, 31 Jan 2025 21:10:38 +0100

> What is the complete pg_basebackup command?
The command: pg_basebackup -h <IP> -p <PORT> -U <USERNAME> -D <ABSOLUTE_PATH> -Ft -z -P -v --wal-method=none

So basically the same as the 2nd example here: https://www.postgresql.org/docs/16/app-pgbasebackup.html except for the verbose flag and the wal-method flag.

The wal-method is none for 2 reasons:
1/ Experience teaches that, in the event of storage being on a network, timeouts to write WAL archives tot he network location can cause WAL creation during a basebackup to be considered failed, and that causes the entire basebackup to be considered failed, even if a retry occurs. Any failure during a basebackup will cause postgres to auto-delete it at the very end of pg_basebackup, declaring it "unusable". This is extremely bad in backups that take very long. Better to not include WAL files in the basebackup, and just get them after the fact.
2/ All my WAL files are archived and uploaded to the cloud. So, I can just have them downloaded.

This has worked for months on end, and so has restoring.

> I don't understand the above.

> What is determining that a particular WAL file should be asked for?
The postgres server itself does this. Here's the documentation: https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-RESTORE-COMMAND

And here: https://www.postgresql.org/docs/current/warm-standby.html

In practice, Postgres will see the "standby.signal" file and start asking for WAL files. It will read the database it has and determine what the next WAL filename should be. And then it asks for it. And it will keep asking for these hexadecimal filenames, 1 at a time, for as long as the command or set of commands provided to "restore_command" returns exit code 0. If the process receives any other exit code, it stops recovery, switches timeline, and considers the database to be up and running at the state its in.

It's constantly asking "I want this file now" and the script I have as the restore command will attempt to download it from the cloud. Then it will attempt to unzip it and move it into place. If any of these steps fails, I return exit code 1.

> How active is the primary database you are pulling from?

Very active, plus automated testing to ensure everything is still running, this will generate multiple items per day on its own.

> Available where?

The cloud, as I stated: WAL files get archived and these archived files are then uploaded to the cloud.

See documentation: https://www.postgresql.org/docs/16/runtime-config-wal.html#GUC-ARCHIVE-COMMAND

Regards,
Koen De Groote

On Fri, Jan 31, 2025 at 5:50 PM Adrian Klaver <adrian.klaver@xxxxxxxxxxx> wrote:
On 1/31/25 01:47, Koen De Groote wrote:

Comments in line.

> I'm running postgres 16.6

> 

> My backup strategy is: basebackup and WAL archive. These get uploaded to 

> the cloud.

> 

> The restore is on an isolated machine and is performed daily. It 

> downloads the basebackup, unpacks it, sets a recovery.signal, and a 

> script is provided as restore_command, to download the WAL archives %f 

> and unpack them into %p

> 

What is the complete pg_basebackup command?

> In the script, the final unpacking is simply "gzip -dc %f > %p". The gz 

> files are first checked with "gzip -t".

> 

> If a WAL archive is asked that doesn't exist yet, the script naturally 

> cannot find it, and exits with status code 1. This is the end of the 

> recovery.

I don't understand the above.

What is determining that a particular WAL file should be asked for?

> 

> There are a few tables that are known to receive new entries multiple 

> times per day. However, the state of the recovery showed the latest item 

> to be 2 days in the past. Checking the live DB, there are an expected 

> amount of items since that ID.

How active is the primary database you are pulling from?

> 

> I checked the logs, the last WAL archive that got downloaded is indeed 

> the last one that was available. The one that failed to download on the 

> restore machine, was uploaded to the cloud 8 minutes later, according to 

> the upload logs on the live DB.

Available where?

If that was the last one available how could the subsequent one be a 

failure to download?

> 

> The postgres logs themselves seem perfectly normal. It logs all these 

> WAL recoveries, switches the timeline, and becomes available.

> 

> What could be going wrong? My main issue is that I don't know where to 

> start looking, since nothing in the logs seems abnormal.

> 

> Regards,

> Koen De Groote

-- 

Adrian Klaver

adrian.klaver@xxxxxxxxxxx