Re: Any better way to ensure WAL continuity over failovers ?

harinath kanchu <kanchuharinath@xxxxxxxxx> · Tue, 11 Mar 2025 15:41:12 -0700

Hello Mohit,

> Do you have any status of what happened later you did that change?

We are currently making connection from the standbys to primary for
fetching the last archived WAL file info, and it's working as
expected, we never observed any gaps in the WAL files in the archive
storage after this change.
As the primary is the only one that uploads the WAL files, we are
within the API limits of the archive storage.

But I felt making connections every time is a big bottleneck and
sometimes you have to use superuser to get the last few connections
reserved, in case the user is bombarding the database with a lot of
connections already.

If the info is given to standby continuously without connections, it
will be great, and one way I thought of is using the "Wal Sender Keep
Alive" messages. I am yet to test this in production, but want to
check the feedback of the community before doing so.

Thanks,

Best,
Harinath

On Tue, Mar 11, 2025 at 10:15 AM Mohit Mishra
<mohitmishra786687@xxxxxxxxx> wrote:
>
> Hello Harinath,
>
> Do you have any status of what happened later you did that change?
>
> It will help understand your context better.
>
> Br,
> Mohit
>
> On Tue, 11 Mar 2025, 07:22 harinath kanchu, <kanchuharinath@xxxxxxxxx> wrote:
>>
>> Hello Postgres Admins,
>>
>> We have a PostgreSQL setup with a primary and 2 standby nodes, with
>> synchronous_commit on and sync_standby_names = ANY 1(standby1,
>> standby2).
>>
>>  Assuming we use 'archive_mode = on' on all replicas, in the event of
>> a primary failure and subsequent failover to a standby node, there is
>> a risk of the primary having failed to upload some WALs to archive
>> storage before the failover.
>>
>> On the other hand, setting 'archive_mode = always' ensures WAL
>> retention on standbys but introduces additional inefficiencies - this
>> would likely require trying to upload each WAL segment from each
>> standby, or at the very least making additional calls to check for WAL
>> presence in the archive. This can be costly and may not scale well.
>>
>> Is there a recommended and optimal way to ensure all WAL files are
>> always uploaded, even across failovers?
>>
>> Here's how we ended up addressing this:  we set 'archive_mode =
>> always' on all replicas, but in our archive_command on standby nodes,
>> we connect to the primary server and query `pg_stat_archiver` to
>> determine the last successfully archived WAL file. Based on this
>> information, the we either returns an exit code 0 (if primary confirms
>> WAL was archived) or a non-zero value (primary has not, or fails to
>> respond).
>>
>> This enables us to retain all WALs on the standby that have not so far
>> been archived by the primary, and prevents redundant uploads - but it
>> involves additional communication between standbys and the primary.
>>
>>  My idea for a more robust solution is to include the last
>> successfully archived WAL file information in the keep-alive messages
>> sent from the primary to standby nodes. This would allow standby
>> servers to use this data directly, eliminating the need for extra
>> queries to the primary.
>>
>> If the community finds this approach useful, I would be interested in
>> contributing a patch to implement this improvement. Please let us know
>> your thoughts!
>>
>> Thanks,
>>
>> Best,
>> Harinath
>>
>>