Hello,
We are using PostgreSQL 15.3 with 5 servers configured for streaming replication:
DB1 = master database
DB1 = master database
DB2 = replica of DB1
Warehouse 1 = replica of DB2 (cascaded)
Warehouse 2 = replica of DB2 (cascaded)
Remote Warehouse = replica of DB2 (cascaded)
Every now and then we experience replication lag in 2 cases:
1. Heavy writing on DB1, in which case Remote Warehouse is lagging the most, since the network bandwidth to it is limited;
2. Slow query executing on one of the replicas, in which case the WAL replaying is paused until the query finishes.
Yet, in one day we had twice a strange situation when all of a sudden all the replicas started lagging at the same time. It lasted for about 30 minutes when it auto-fixed itself. The simultaneity excluded case #2 cited above, and the fact that the last replica to recover was actually not the usual Remote Warehouse, but Warehouse 2, excluded also case #1, as if there was so heavy writing, then the last to recover would be the furthest topologically.
https://ibb.co/hWmhjkL
In the screenshot above the green line is the replica delay of Warehouse 2 (the last to recover with the maximal lag), while the red one - Remote Warehouse (the first to recover with minimal lag).
https://ibb.co/hWmhjkL
In the screenshot above the green line is the replica delay of Warehouse 2 (the last to recover with the maximal lag), while the red one - Remote Warehouse (the first to recover with minimal lag).
Any ideas why something like that would happen? What should I be looking for in the logs? What is the theoretical reason for this?
Kind regards,
--
Kouber Saparev