> On Nov 12, 2018, at 5:41 AM, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> wrote: > > This Sunday (yesterday) we had an incident caused by wal sender terminating (on Friday) after reaching timeout (5 mins). This left the replication slot retaining wals till our production primary server run out of space. (this is not connected with the wal fill up of the previous Sunday nor does it explain why it happened, still in the dark about this one). This sounds like there was a network related issue. Did the WAL receiver timeout too or did it remain “connected”? If the downstream server did not detected the network issue thus failed to drop the abandoned connection and reconnect on its own then this is normal behavior as the replication slot would not have be active. I’m a bit confused and as I thought you stated before that you checked the replication slots and they where active and moving forward; right? > - We give you a mechanism to detect failures, we set the default timeout at 60 seconds, and you are responsible to monitor this and act accordingly or write an automated tool to handle such events (to do what???), otherwise set it to 0 but be prepared, in case of permanent problems, to loose availability when you run out of disk space. > > So is there any way to restart the WAL sender ? Is there any way to tell postgresql to retry after this specified amount of time? Otherwise what is the purpose of the LOG message? (which is not even an ERROR ?) Should a restart of the subscriber or the publisher node remedy this? wal_sender_timeout and wal_receiver_timeout are timeout and Postgres will terminate the connect and the downstream server will reconnect on its own (as long as it terminates its own connection — wal_receiver_timeout). This is very useful when you have an overzealous firewall that drops connections due to idle sessions without resets or any other situation that causes a network connection issue. Disabling the timeout seems like a really bad idea, the end result then would depend on your TCP/IP stack witch can take a day or so to detect an abandoned connect unless TCP/IP keep alive is enabled. And, I would recommenced setting up TCP/IP keep alive to detect abandoned sessions. i.e. again a firewall dropping a user sessions without a reset (can happened on a long running query as the connection would look idle to the firewall); or a user closing the lid on their laptop and heading home for the day while being logged in — without the timeouts their session will continue to run active queries and/or hold on to opened transactions until the OS terminates the session or some other timeout is reached. Did you check to see if you have any long running queries or opened transactions that are holding on to a xmin?