Re: Postgres restore sometimes restores to a point 2 days in the past

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Fri, 31 Jan 2025 08:50:34 -0800

On 1/31/25 01:47, Koen De Groote wrote:

Comments in line.

I'm running postgres 16.6

My backup strategy is: basebackup and WAL archive. These get uploaded to 
the cloud.

The restore is on an isolated machine and is performed daily. It 
downloads the basebackup, unpacks it, sets a recovery.signal, and a 
script is provided as restore_command, to download the WAL archives %f 
and unpack them into %p

What is the complete pg_basebackup command?

In the script, the final unpacking is simply "gzip -dc %f > %p". The gz 
files are first checked with "gzip -t".

If a WAL archive is asked that doesn't exist yet, the script naturally 
cannot find it, and exits with status code 1. This is the end of the 
recovery.

I don't understand the above.

What is determining that a particular WAL file should be asked for?

There are a few tables that are known to receive new entries multiple 
times per day. However, the state of the recovery showed the latest item 
to be 2 days in the past. Checking the live DB, there are an expected 
amount of items since that ID.

How active is the primary database you are pulling from?

I checked the logs, the last WAL archive that got downloaded is indeed 
the last one that was available. The one that failed to download on the 
restore machine, was uploaded to the cloud 8 minutes later, according to 
the upload logs on the live DB.

Available where?

If that was the last one available how could the subsequent one be a 
failure to download?

The postgres logs themselves seem perfectly normal. It logs all these 
WAL recoveries, switches the timeline, and becomes available.

What could be going wrong? My main issue is that I don't know where to 
start looking, since nothing in the logs seems abnormal.

Regards,
Koen De Groote

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx