Hello,
I'm encountering a repeating problem with logical replication.
The issue began occurring right after upgrading the cluster from 10.x to
13.x.
There are two databases - `mc` and `comm` - within the same cluster. Few
tables are replicated logically from `mc` to `comm`. The problem is that
the replication gradually slows down (causing WAL storage to be
extended), eventually leading to seemingly complete stop.
The WALSender process is caught in some kind of loop and I'm unable to
either pg_cancel_backend or pg_terminate_backend it.
After some time, PostgreSQL notices that and tries to spawn another
WALSender process, unfortunately the slot is already taken so this fails
too.
The only way to restart the replication is to either restart the whole
cluster or to force exit(1) on the walsender process by using gdb for
example.
After such restart the replications starts at full speed. If there's a
lot of data to process, then I may have to restart it more than once, as
the problem with slowing down and stopping repeats.
At current flow the issue occurs every few hours. There are no log
entries as to why this my occur. The replication is not locked by other
queries/transactions.
I've captured the strace of walsender right before the 'hang':
https://gist.github.com/lpiob/09088d9b0d9becebe3a213fe6f15b5ed
And also a gdb backtrace:
(gdb) bt
#0 0x00005623348d3ce0 in hash_seq_search ()
#1 0x000056233473a396 in ReorderBufferQueueChange ()
#2 0x000056233472fb80 in LogicalDecodingProcessRecord ()
#3 0x00005623347542e3 in ?? ()
#4 0x0000562334756992 in ?? ()
#5 0x00005623347576ff in exec_replication_command ()
#6 0x000056233479c9c5 in PostgresMain ()
#7 0x000056233446d6f2 in ?? ()
#8 0x000056233471ec5e in PostmasterMain ()
#9 0x000056233446e91a in main ()
(gdb)
The issue I'm experiencing look awfully similar to following one:
https://www.postgresql.org/message-id/6fa054d8-ad14-42a2-8926-5d79c97ecd65@xxxxxxxxxxxxxxxxxxxxx
Although in my case, there are no schema changes.
I've tried dumping and restoring the database on separate host, setting
up the replication from scratch. I've ruled out hardware issues or
depleted disk resources. Raising the wal_receiver_timeout or
wal_sender_timeout has no effect - it seems that the setting is
completely ignored.
Please advise on how to debug the issue further or how to resolve it.
I'm completely at loss as to the reasons why it may be happening.
--
Lukasz Biegaj | Unity Group | https://www.unitygroup.com/
System Architect, AWS Solutions Architect