Search Postgresql Archives

Logical replication gradually slowing down, then hanging.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I'm encountering a repeating problem with logical replication.

The issue began occurring right after upgrading the cluster from 10.x to 13.x.

There are two databases - `mc` and `comm` - within the same cluster. Few tables are replicated logically from `mc` to `comm`. The problem is that the replication gradually slows down (causing WAL storage to be extended), eventually leading to seemingly complete stop.

The WALSender process is caught in some kind of loop and I'm unable to either pg_cancel_backend or pg_terminate_backend it. After some time, PostgreSQL notices that and tries to spawn another WALSender process, unfortunately the slot is already taken so this fails too.

The only way to restart the replication is to either restart the whole cluster or to force exit(1) on the walsender process by using gdb for example.

After such restart the replications starts at full speed. If there's a lot of data to process, then I may have to restart it more than once, as the problem with slowing down and stopping repeats.

At current flow the issue occurs every few hours. There are no log entries as to why this my occur. The replication is not locked by other queries/transactions.

I've captured the strace of walsender right before the 'hang': https://gist.github.com/lpiob/09088d9b0d9becebe3a213fe6f15b5ed

And also a gdb backtrace:
(gdb) bt
#0  0x00005623348d3ce0 in hash_seq_search ()
#1  0x000056233473a396 in ReorderBufferQueueChange ()
#2  0x000056233472fb80 in LogicalDecodingProcessRecord ()
#3  0x00005623347542e3 in ?? ()
#4  0x0000562334756992 in ?? ()
#5  0x00005623347576ff in exec_replication_command ()
#6  0x000056233479c9c5 in PostgresMain ()
#7  0x000056233446d6f2 in ?? ()
#8  0x000056233471ec5e in PostmasterMain ()
#9  0x000056233446e91a in main ()
(gdb)


The issue I'm experiencing look awfully similar to following one:
https://www.postgresql.org/message-id/6fa054d8-ad14-42a2-8926-5d79c97ecd65@xxxxxxxxxxxxxxxxxxxxx

Although in my case, there are no schema changes.

I've tried dumping and restoring the database on separate host, setting up the replication from scratch. I've ruled out hardware issues or depleted disk resources. Raising the wal_receiver_timeout or wal_sender_timeout has no effect - it seems that the setting is completely ignored.

Please advise on how to debug the issue further or how to resolve it. I'm completely at loss as to the reasons why it may be happening.



--
Lukasz Biegaj | Unity Group | https://www.unitygroup.com/
System Architect, AWS Solutions Architect





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux