Aleš Zelený <zeleny.ales@xxxxxxxxx> writes: > Hello, > > we are suing logical replication on 10.4 and it now hangs. After > some timeout it is retarted again, replaying 18GB of data and then > hangs (while 7GB of wals remains to be proceeded). Timeout... Have a look at the 2 setting wal sender/receiver timeout and you probably need to raise the sender timeout value. HTH > > The backlog of 18GB comes from a failed migration adding new table to > replication while replication user was not granted to select the > table. This was clear from log files and once resolved by adding > select privilege, I thought that all will work as usual (same > happened in test env. many times and adding missing grant for select > was sufficient to get it working... these were issues on tests). > > RDBMS Version: > PostgreSQL 10.4 (Ubuntu 10.4-2.pgdg16.04+1) on x86_64-pc-linux-gnu, > compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609, > 64-bit > > Publication (master) instance error message from log: > > 2018-11-02 18:34:14 UTC 7974 5bdc8d27.1f26 7 192.168.23.11(58884) > master_prod repusr 0 5/0 sub_eur idle [00000]:LOG: terminating > walsender process due to replication timeout > > Subscription instance log: > > 2018-11-02 18:34:14 UTC 8657 5bdc8d26.21d1 2 0 3/0 > [XX000]:ERROR: could not receive data from WAL stream: SSL > connection has been closed unexpectedly > 2018-11-02 18:34:14 UTC 24699 5b923b1c.607b 1209397 0 > [00000]:LOG: worker process: logical replication worker for > subscription 37932 (PID 8657) exited with exit code 1 > > While it seems to be a network issue, it is may be not - we have > checked the network and even monitoring, all the time some packets > were exchanged. > > We do have 3 subscriptions(thus repl. slots) for one publication, one > subscriber instance is within same datacenter as master, remainig > subscribers are remote. > > >>> select * from pg_replication_slots > +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+ > | slot_name | plugin | slot_type | datoid | database | > temporary | active | active_pid | xmin | catalog_xmin | restart_lsn > | confirmed_flush_lsn | > +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+ > | sub_usd | pgoutput | logical | 16421 | master_prod | > False | True | 16604 | | 5536488 | 426/AAE55A68 > | 426/AAE55A68 | > | sub_cad | pgoutput | logical | 16421 | master_prod | > False | True | 22875 | | 5536488 | 426/AAE55A68 > | 426/AAE55A68 | > | sub_eur | pgoutput | logical | 16421 | master_prod | > False | True | 16605 | | 5536488 | 426/AAE55A68 > | 426/AAE55A68 | > +------------+----------+-----------+--------+-------------+-----------+--------+------------+------+--------------+--------------+---------------------+ > > > Once after the error occurs, wal senders are re/started and they > reads wal files until they reach restart_lsn wal file: > >>> select pg_walfile_name('426/AAE55A68') > +--------------------------+ > | pg_walfile_name | > +--------------------------+ > | 0000000100000426000000AA | > +--------------------------+ > > # changing file names until they reach this one: > root@master-db:/pgsql/pgcluster/10/master_prod# lsof -p 1560 -p 5758 > -p 5790| grep pg_wal > postgres 1560 postgres 10r REG 259,3 16777216 115766007 / > pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069 > postgres 5758 postgres 10r REG 259,3 16777216 115766007 / > pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069 > postgres 5790 postgres 10r REG 259,3 16777216 115766007 / > pgsql/pgcluster/10/master_prod/pg_wal/000000010000042900000069 > > > At this moment sent_lsn stops growing and nothing is happening for a > while. > > select * from pg_stat_replication; > > pid | usesysid | usename | application_name | client_addr | > client_hostname | client_port | backend_start | > backend_xmin | state | sent_lsn | write_lsn | flush_lsn | > replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | > sync_state > -------+----------+---------+--------------------+---------------+-----------------+-------------+-------------------------------+--------------+---------+--------------+--------------+--------------+--------------+-----------+-----------+------------+---------------+------------ > 16604 | 37868 | repusr | sub_usd | 192.168.21.11 | > | 35010 | 2018-11-02 23:52:22.059157+00 | > | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68 > | 426/AAE55A68 | | | | 0 | > async > 16605 | 37868 | repusr | sub_eur | 192.168.23.11 | > | 36388 | 2018-11-02 23:52:24.75038+00 | > | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68 > | 426/AAE55A68 | | | | 0 | > async > 12387 | 37868 | repusr | sub_cad | 192.168.22.11 | > | 51982 | 2018-11-02 23:30:31.372404+00 | > | catchup | 429/69E9CC60 | 426/AAE55A68 | 426/AAE55A68 > | 426/AAE55A68 | | | | 0 | > async > (3 rows) > > What I found being done behind the scenes while from > pg_stat_replication "nothing is happening": > > On Master wal snaders strace look like: > > open("pg_replslot/sub_usd/xid-6040508-lsn-429-57000000.snap", > O_WRONLY|O_CREAT|O_APPEND, 0600) = 15 > write(15, "\334\0\0\0\0\0\0\0H\3\234W)\4\0\0\0\0\0\0\0\0\0\0\177\6\0\ > 0%@\0\0"..., 220) = 220 > close(15) = 0 > > On subscription instance, strice for wal reciever looks like this: > recvfrom(7, 0x55f56a3783c3, 5, 0, NULL, NULL) = -1 EAGAIN (Resource > temporarily unavailable) > epoll_create1(EPOLL_CLOEXEC) = 10 > epoll_ctl(10, EPOLL_CTL_ADD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32= > 1781282352, u64=94512536630832}}) = 0 > epoll_ctl(10, EPOLL_CTL_ADD, 9, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32= > 1781282376, u64=94512536630856}}) = 0 > epoll_ctl(10, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32= > 1781282400, u64=94512536630880}}) = 0 > epoll_wait(10, [], 1, 1000) = 0 > > tcpdump: > 21:33:16.811590 IP (tos 0x0, ttl 64, id 24478, offset 0, flags [DF], > proto TCP (6), length 52) > 192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xe2ab > (correct), ack 1360, win 235, options [nop,nop,TS val 3699603 ecr > 2060668460], length 0 > 21:33:26.821860 IP (tos 0x0, ttl 64, id 64522, offset 0, flags [DF], > proto TCP (6), length 120) > 192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum > 0x3ecf (incorrect -> 0xbaf0), seq 1360:1428, ack 1, win 265, options > [nop,nop,TS val 2060670963 ecr 3699603], length 68 > 21:33:26.822156 IP (tos 0x0, ttl 64, id 24479, offset 0, flags [DF], > proto TCP (6), length 52) > 192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xceda > (correct), ack 1428, win 235, options [nop,nop,TS val 3702105 ecr > 2060670963], length 0 > 21:33:36.832445 IP (tos 0x0, ttl 64, id 64523, offset 0, flags [DF], > proto TCP (6), length 120) > 192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum > 0x3ecf (incorrect -> 0x3881), seq 1428:1496, ack 1, win 265, options > [nop,nop,TS val 2060673465 ecr 3702105], length 68 > 21:33:36.832752 IP (tos 0x0, ttl 64, id 24480, offset 0, flags [DF], > proto TCP (6), length 52) > 192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xbb09 > (correct), ack 1496, win 235, options [nop,nop,TS val 3704608 ecr > 2060673465], length 0 > 21:33:46.843063 IP (tos 0x0, ttl 64, id 64524, offset 0, flags [DF], > proto TCP (6), length 120) > 192.168.21.11.34744 > 192.168.20.11.5432: Flags [P.], cksum > 0x3ecf (incorrect -> 0xcf34), seq 1496:1564, ack 1, win 265, options > [nop,nop,TS val 2060675968 ecr 3704608], length 68 > 21:33:46.843362 IP (tos 0x0, ttl 64, id 24481, offset 0, flags [DF], > proto TCP (6), length 52) > 192.168.20.11.5432 > 192.168.21.11.34744: Flags [.], cksum 0xa737 > (correct), ack 1564, win 235, options [nop,nop,TS val 3707111 ecr > 2060675968], length 0 > > Some checksum issues, but finally ack-ed. Tried to open psql session > from host running subscription instance to master instance and it > worked like a charm. > > > I've tried pg_waldump, but nothing what I'll be able to consider as > error... >>> select pg_walfile_name('426/AAE55A68') > +--------------------------+ > | pg_walfile_name | > +--------------------------+ > | 0000000100000426000000AA | > +--------------------------+ > /usr/lib/postgresql/10/bin/pg_waldump -s 426/AAE55A68 > 0000000100000426000000AA | less > > rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: > 426/AAE55A68, prev 426/AAE559C0, desc: RUNNING_XACTS nextXid 5536488 > latestCompletedXid 5536487 oldestRunningXid 5536488 > rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: > 426/AAE55AA0, prev 426/AAE55A68, desc: RUNNING_XACTS nextXid 5536488 > latestCompletedXid 5536487 oldestRunningXid 5536488 > rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: > 426/AAE55AD8, prev 426/AAE55AA0, desc: RUNNING_XACTS nextXid 5536488 > latestCompletedXid 5536487 oldestRunningXid 5536488 > rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: > 426/AAE55B10, prev 426/AAE55AD8, desc: RUNNING_XACTS nextXid 5536488 > latestCompletedXid 5536487 oldestRunningXid 5536488 > rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: > 426/AAE55B48, prev 426/AAE55B10, desc: RUNNING_XACTS nextXid 5536488 > latestCompletedXid 5536487 oldestRunningXid 5536488 > rmgr: Standby len (rec/tot): 50/ 50, tx: 0, lsn: > 426/AAE55B80, prev 426/AAE55B48, desc: RUNNING_XACTS nextXid 5536488 > latestCompletedXid 5536487 oldestRunningXid 5536488 > rmgr: XLOG len (rec/tot): 24/ 24, tx: 0, lsn: > 426/AAE55BB8, prev 426/AAE55B80, desc: SWITCH > > > I've also did the same for sent_lsn: >>> select pg_walfile_name('429/69E9CC60'); > +--------------------------+ > | pg_walfile_name | > +--------------------------+ > | 000000010000042900000069 | > +--------------------------+ > > /usr/lib/postgresql/10/bin/pg_waldump -s 429/69E9CC60 > 000000010000042900000069 | less > rmgr: Heap len (rec/tot): 76/ 76, tx: 5536495, lsn: > 429/69E9CC60, prev 429/69E9CC10, desc: DELETE off 81 KEYS_UPDATED , > blkref #0: rel 1663/16421/38572 blk 40 > rmgr: Heap len (rec/tot): 76/ 76, tx: 5536495, lsn: > 429/69E9CCB0, prev 429/69E9CC60, desc: DELETE off 82 KEYS_UPDATED , > blkref #0: rel 1663/16421/38572 blk 40 > rmgr: Heap len (rec/tot): 76/ 76, tx: 5536495, lsn: > 429/69E9CD00, prev 429/69E9CCB0, desc: DELETE off 83 KEYS_UPDATED , > blkref #0: rel 1663/16421/38572 blk 40 > > The relation 38572 from list above is an ordinary non replicated > table. > > Any help or advice how to diagnose/get it working is highly > appreciated. > > Kind regards Ales Zeleny > > -- Jerry Sievers Postgres DBA/Development Consulting e: postgres.consulting@xxxxxxxxxxx p: 312.241.7800