In addition, we monitor the amount of WALs generated per minute. We did not observe any peaks within this time frame. The rate stays within the normal 30-40 (per minute) range. In contrast, when we repack some huge tables, the write rate goes up to 400+ WALs per minute and then Remote Warehouse lags behind.
So for some reason, the master database (DB1) is stalling all the WAL senders (1 in fact, as only DB2 is directly connected to it) while it executes something. The question is what is this thing? What might cause such behaviour?
На пт, 4.08.2023 г. в 0:15 ч. Kouber Saparev <kouber@xxxxxxxxx> написа:
Hello,We are using PostgreSQL 15.3 with 5 servers configured for streaming replication:
DB1 = master databaseDB2 = replica of DB1Warehouse 1 = replica of DB2 (cascaded)Warehouse 2 = replica of DB2 (cascaded)Remote Warehouse = replica of DB2 (cascaded)Every now and then we experience replication lag in 2 cases:1. Heavy writing on DB1, in which case Remote Warehouse is lagging the most, since the network bandwidth to it is limited;2. Slow query executing on one of the replicas, in which case the WAL replaying is paused until the query finishes.Yet, in one day we had twice a strange situation when all of a sudden all the replicas started lagging at the same time. It lasted for about 30 minutes when it auto-fixed itself. The simultaneity excluded case #2 cited above, and the fact that the last replica to recover was actually not the usual Remote Warehouse, but Warehouse 2, excluded also case #1, as if there was so heavy writing, then the last to recover would be the furthest topologically.
https://ibb.co/hWmhjkL
In the screenshot above the green line is the replica delay of Warehouse 2 (the last to recover with the maximal lag), while the red one - Remote Warehouse (the first to recover with minimal lag).Any ideas why something like that would happen? What should I be looking for in the logs? What is the theoretical reason for this?Kind regards,--Kouber Saparev