LwLocks contention (MultiXactOffsetControlLock/multixact_offset) when running logical replication initial snapshot

Marko Sutic <marko.sutic@xxxxxxxxx> · Mon, 8 Apr 2024 14:48:33 +0200

Hello,
We are currently using a shared PostgreSQL cluster (version 11.18) that supports over ten databases. To alleviate the load on this cluster, we've decided to migrate certain databases to dedicated clusters using the native logical replication feature. This approach has been successfully applied to between 50 and 100 databases without issues. However, we have recently encountered an issue related to LWLocks contention.

The problem happened during the taking of an initial snapshot of a slightly bigger database, approximately 500GB, with a single table accounting for 300GB. Although the database remained operational, its performance degraded significantly for some services. Threads experienced delays of 20-30 seconds per simple execution when waiting for the “LWLock:MultiXactOffsetControlLock” and “multixact_offset” locks, which also blocked other processes. This issue did not happen immediately but after a few hours running initial snapshot creation required for logical replication.

Interestingly, not all databases or queries were impacted. The performance degradation primarily affected specific queries, which I've listed below with anonymized table names for confidentiality:

Database "migrated_db":
Insert Query: INSERT INTO library_books (author_id, genre_id, book_id, publisher, library_id, section_key, content) VALUES ($1, $2, $3, $4, $5, $6, $7);
Select Query: SELECT $2 FROM ONLY "academic_records"."lecture_series" x WHERE "professor_id" = $1 FOR KEY SHARE OF x;

Database "other_db":
Update Query: UPDATE "vehicle_registry" SET "mileage_count" = mileage_count + $1 WHERE "vehicle_id" = $2 RETURNING "mileage_count";

These queries experienced significant increases in execution time and shared buffer reads per call. The "library_books" table swelled from 500KB to nearly 800MB, showing increased bloat and the oldest row age. Noticeable drop in transaction rate was visible for affected services.
Upon discontinuing the replication, the locks were released, and the "library_books" table returned to its original size of 500KB, with performance levels improving correspondingly.

Could you please provide insights on how the initial snapshot for logical replication could be causing these LWLocks contention issues? Furthermore, why are only certain queries affected, including some from non-migrated databases?
Would initiating the snapshot with pg_dump, reducing or temporarily removing the workload on the affected queries, or making certain parameter adjustments help resolve this issue?

Thank you for your assistance and insights.

Best regards,
Marko