Re: Intermittent Issue with WAL Segment Removal in Logical Replication

Tomas Vondra <tomas.vondra@xxxxxxxxxxxxxxxx> · Sat, 30 Dec 2023 01:26:35 +0100

On 12/29/23 22:28, Kaushik Iska wrote:
> I am unfortunately not really familiar with Google Cloud SQL internals
> as well. But we have seen this happen on Amazon RDS as well.
> 

Do you have a reproducer for regular Postgres?

> Could it be possible that we are requesting a future WAL segment, say
> WAL upto X is written and we are asking for X + 1? It could be that the
> error message is misleading.
> 

I don't think that should be possible. The LSN in the START_REPLICATION
comes from the replica, where it's tracked as the last LSN received from
the upstream. So that shouldn't be in the future. And it's doesn't seem
to be suspiciously close to segment boundary either.

In fact, the LSN in the message is 6/5AE67D79, but the "failed" segment
is 000000010000000600000059, which is the *preceding* one. So it can't
be in the future.

> I do not have the information from pg_replication_slots as I have
> terminated the test. I am fairly certain that I can reproduce this
> again. I will gather both the restart_lsn and contents of pg_wal for the
> failed segment. Is there any other information that would help debug
> this further?
> 

Hard to say. The best thing would be to have a reproducer script, ofc.
If that's not possible, the information already requested seems like a
good start.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company