Long running vacuum and logical replication delay

Alex Adriaanse <alex.adriaanse@xxxxxxxxxxxxx> · Thu, 26 Sep 2024 17:32:33 -0500

Hi pgsql-admins,

We observed an instance of growing replication delay (measured using
pg_current_wal_lsn() - confirmed_flush_lsn) for a logical replication
slot used by Debezium that strongly correlated with autovacuum
activity.

Logical replication delay started growing after a burst in deletes,
performed in batches/transactions of 1,000 rows over a span of 40
minutes. This delete resulted in 68 million dead rows, representing
20% of total rows in the given table, which triggered an autovacuum.
After the deletes finished, logical replication delay continued to
grow from normal write activity as long as autovacuum was running.
During this time WALs grew to 68 GiB. When autovacuum completed almost
3 hours later, logical replication delay immediately started going
down, and there was an increase in network traffic to Debezium as
replication started catching up. We did not see CPU or I/O saturation
during the time that logical replication struggled to keep up.

Would this be an indicator to increase logical_decoding_work_mem? If
not, what symptoms would indicate a need to increase this parameter?

Also, what could cause this correlated behavior? Brainstorming ideas
on potential causes:
- Does the WAL activity generated by vacuums slow down walsender or
logical decoding enough to cause this issue?
- Do autovacuum and logical replication share resources?
- Does a running vacuum block logical replication altogether for a given table?
- Does the presence of dead rows slow down logical replication?
- Could the large number of accumulated WAL files slow down logical
replication (e.g. due to syscalls taking longer to return)?
- Could causality be reversed: instead of autovacuum causing
replication delay, did replication delay from the large delete cause
autovacuum to not finish until the delete had fully replicated?

Thanks,

Alex