At Thu, 9 Sep 2021 14:52:25 +0900, Abhishek Bhola <abhishek.bhola@xxxxxxxxxxxxxxx> wrote in > I have found some questions about the same error, but didn't find any of > them answering my problem. > > The setup is that I have two Postgres11 clusters (A and B) and they are > making use of publication and subscription features to copy data from A to > B. > > A (source DB- publication) --------------> B (target DB - subscription) > > This works fine, but often (not always) when the data volume being inserted > on a table in node A increases, it gives the following error. > > "terminating walsender process due to replication timeout" > > The data volume at the moment being entered is about 30K rows per second > continuously for hours through COPY command. > > Earlier the wal_sender_timeout was set to 5 sec and I would see this error > much often. I then increased it to 1 min and the frequency of this error > reduced. But I don't want to keep increasing it without understanding what > is causing it. I looked at the code of walsender.c and know the exact lines > where it's coming from. > > But I am still not clear which parameter is making the sender assume that > the receiver node is inactive and therefore it should stop the wal_sender. > > Can anyone please suggest what changes I should make to remove this error? What minor-version is the Postgres server mentioned? PostgreSQL 11 have gotten the following fix at 11.6, which could be related to the trouble. https://www.postgresql.org/docs/11/release-11-6.html > Fix timeout handling in logical replication walreceiver processes > (Julien Rouhaud) > > Erroneous logic prevented wal_receiver_timeout from working in > logical replication deployments. The details of the fix is here. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3f60f690fac1bf375b92cf2f8682e8fe8f69098 > Fix timeout handling in logical replication worker > > The timestamp tracking the last moment a message is received in a > logical replication worker was initialized in each loop checking if a > message was received or not, causing wal_receiver_timeout to be ignored > in basically any logical replication deployments. This also broke the > ping sent to the server when reaching half of wal_receiver_timeout. regards. -- Kyotaro Horiguchi NTT Open Source Software Center