Re: PostgreSQL 10.5 : Logical replication timeout results in PANIC in pg_wal "No space left on device"

Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> · Tue, 13 Nov 2018 09:52:07 +0200

On 12/11/18 6:30 μ.μ., Rui DeSousa wrote:

On Nov 12, 2018, at 5:41 AM, Achilleas Mantzios <achill@xxxxxxxxxxxxxxxxxxxxx> wrote:

This Sunday (yesterday) we had an incident caused by wal sender terminating (on Friday) after reaching timeout (5 mins). This left the replication slot retaining wals till our production primary server run out of space. (this is not connected with the wal fill up of the previous Sunday nor does it explain why it happened, still in the dark about this one).
This sounds like there was a network related issue.

yes, apparently, or a transaction was large enough so that it didn't get through within the timeout limit. However I counted about 24 WAL files in this period  so it would be max about 400MB. Which 
means the speed fell down to 1.3MB/sec which is pretty slow for connections between the same cloud provider.

Did the WAL receiver timeout too or did it remain “connected”?

I mention it in the original message :
"By querying pg_stat_subscription all 3 timestamps were about 5 mins prior to the above walsender termination.
I didn't get *any* ERROR/FATAL message on the subscriber side".

Or to be more precise, I didn't get any related message on the subscriber side. Here what the subscriber logs contain for the said time window :

 [106486] 2018-11-09 14:55:16.996 EET  @ line:468 LOG:  checkpoint complete: wrote 17564 buffers (1.7%); 0 WAL file(s) added, 0 removed, 43 recycled; write=610.394 s, sync=0.044 s, total=610.521 s; 
sync files=507, longest=0.011 s, average=0.000 s; distance=564195 kB, estimate=712814 kB
 [106486] 2018-11-09 14:55:19.093 EET  @ line:469 LOG:  checkpoint starting: xlog
10.9.202.211(49220) [18191] 2018-11-09 14:59:35.772 EET [unknown] [unknown]@[unknown] line:1 LOG:  connection received: host=10.9.202.211 port=49220
.......
 [106486] 2018-11-09 15:06:56.846 EET  @ line:470 LOG:  checkpoint complete: wrote 9034 buffers (0.9%); 0 WAL file(s) added, 0 removed, 24 recycled; write=697.701 s, sync=0.003 s, total=697.752 s; 
sync files=414, longest=0.002 s, average=0.000 s; distance=722719 kB, estimate=722719 kB
.......
 [106486] 2018-11-09 15:10:19.880 EET  @ line:471 LOG:  checkpoint starting: time
 [106486] 2018-11-09 15:23:49.049 EET  @ line:472 LOG:  checkpoint complete: wrote 9554 buffers (0.9%); 0 WAL file(s) added, 0 removed, 0 recycled; write=809.158 s, sync=0.003 s, total=809.169 s; 
sync files=453, longest=0.001 s, average=0.000 s; distance=327761 kB, estimate=683223 kB

If the downstream server did not detected the network issue thus failed to drop the abandoned connection and reconnect on its own then this is normal behavior as the replication slot would not have be active.

The subscriber did not detect the walsender termination. It was only after the subscriber node restart (this Sunday) that it started again asking for WAL segments.
The replication slot lives on the publisher node, it was apparently inactive since the walsender terminated on Friday (till the PANIC)

I’m a bit confused and as I thought you stated before that you checked the replication slots and they where active and moving forward; right?

I have sent two separate emails with two separate subjects, reporting two distinct incidents. In this incident, I could not check any replication slot because the primary had PANIC'ed.

- We give you a mechanism to detect failures, we set the default timeout at 60 seconds, and you are responsible to monitor this and act accordingly or write an automated tool to handle such events (to do what???), otherwise set it to 0 but be prepared, in case of permanent problems, to loose availability when you run out of disk space.

So is there any way to restart the WAL sender ? Is there any way to tell postgresql to retry after this specified amount of time? Otherwise what is the purpose of the LOG message? (which is not even an ERROR ?) Should a restart of the subscriber or the publisher node remedy this?
wal_sender_timeout and wal_receiver_timeout are timeout and Postgres will terminate the connect and the downstream server will reconnect on its own (as long as it terminates its own connection — wal_receiver_timeout).

This didn't happen. Nothing replication related coming from the subscriber logs since the walsender termination.

Did you check to see if you have any long running queries or opened transactions that are holding on to a xmin?

https://www.postgresql.org/docs/10/protocol-logical-replication.html
An uncommitted transaction won't make it to the subscriber.

--
Achilleas Mantzios
IT DEV Lead
IT DEPT
Dynacom Tankers Mgmt