Logical replication gradually slowing down, then hanging.

Lukasz Biegaj <lukasz.biegaj@xxxxxxxxxxxxxx> · Mon, 16 Nov 2020 10:20:42 +0100

Hello,

I'm encountering a repeating problem with logical replication.

The issue began occurring right after upgrading the cluster from 10.x to 
13.x.

There are two databases - `mc` and `comm` - within the same cluster. Few 
tables are replicated logically from `mc` to `comm`. The problem is that 
the replication gradually slows down (causing WAL storage to be 
extended), eventually leading to seemingly complete stop.

The WALSender process is caught in some kind of loop and I'm unable to 
either pg_cancel_backend or pg_terminate_backend it.
After some time, PostgreSQL notices that and tries to spawn another 
WALSender process, unfortunately the slot is already taken so this fails 
too.

The only way to restart the replication is to either restart the whole 
cluster or to force exit(1) on the walsender process by using gdb for 
example.

After such restart the replications starts at full speed. If there's a 
lot of data to process, then I may have to restart it more than once, as 
the problem with slowing down and stopping repeats.

At current flow the issue occurs every few hours. There are no log 
entries as to why this my occur. The replication is not locked by other 
queries/transactions.

I've captured the strace of walsender right before the 'hang': 
https://gist.github.com/lpiob/09088d9b0d9becebe3a213fe6f15b5ed

And also a gdb backtrace:
(gdb) bt
#0  0x00005623348d3ce0 in hash_seq_search ()
#1  0x000056233473a396 in ReorderBufferQueueChange ()
#2  0x000056233472fb80 in LogicalDecodingProcessRecord ()
#3  0x00005623347542e3 in ?? ()
#4  0x0000562334756992 in ?? ()
#5  0x00005623347576ff in exec_replication_command ()
#6  0x000056233479c9c5 in PostgresMain ()
#7  0x000056233446d6f2 in ?? ()
#8  0x000056233471ec5e in PostmasterMain ()
#9  0x000056233446e91a in main ()
(gdb)

The issue I'm experiencing look awfully similar to following one:
https://www.postgresql.org/message-id/6fa054d8-ad14-42a2-8926-5d79c97ecd65@xxxxxxxxxxxxxxxxxxxxx

Although in my case, there are no schema changes.

I've tried dumping and restoring the database on separate host, setting 
up the replication from scratch. I've ruled out hardware issues or 
depleted disk resources. Raising the wal_receiver_timeout or 
wal_sender_timeout has no effect - it seems that the setting is 
completely ignored.

Please advise on how to debug the issue further or how to resolve it. 
I'm completely at loss as to the reasons why it may be happening.

--
Lukasz Biegaj | Unity Group | https://www.unitygroup.com/
System Architect, AWS Solutions Architect