Re: Basebackup fails without useful error message

Koen De Groote <kdg.dev@xxxxxxxxx> · Tue, 22 Oct 2024 21:50:24 +0200

Hello David,

I saw the backup fail. The backup logged that it terminated the walsender, and correlating the moment it failed to the metrics of my storage, shows the storage at that time was facing a huge IOWAIT. And this was a network mounted storage.

The backup process continued, but because of a failure to stream WAL without error(due to a local issue) the entire backup was marked as failed. At the end, pg_basebackup will delete the backup, in this case. There's no flag to control this final behavior.

I'll be testing restore soon without streaming WAL, since the actual restore I perform doesn't use the pg_wal.tar.gz file. It gets the archived WAL At least I think it doesn't need it, hence the need for testing.

Regards,
Koen De Groote

On Tue, Oct 22, 2024 at 12:34 AM David G. Johnston <david.g.johnston@xxxxxxxxx> wrote:
On Sunday, October 20, 2024, Koen De Groote <kdg.dev@xxxxxxxxx> wrote:
I'm going to be testing this. If someone could confirm that this is how writing WAL files works, that being: that it is only considered "done" when the archive_command is done, that would be great.

The archiving of WAL files by the primary does not involve a replication connection of any sort and thus the “WAL sender” settings are not relevant to it; or, here, whether or not you are archiving your WAL is immaterial since you are streaming it as it gets produced.

If you are streaming WAL it seems highly unusual that you’d end up in a situation where the connection goes idle long enough that it gets killed, especially if the backup is still happening.  I’d probably go with performing the backup under a disabled (or extremely large?) timeout though and move on to other things.

That isn’t to say I fully understand what actually is happening here…

David J.