> 19 авг. 2020 г., в 16:07, Jehan-Guillaume de Rorthais <jgdr@xxxxxxxxxx> написал(а): > > On Tue, 18 Aug 2020 13:48:41 +0300 > Олег Самойлов <splarv@xxxxx> wrote: > >> Hi all. >> >> I found some strange behaviour of postgres, which I recognise as a bug. First >> of all, let me explain situation. >> >> I created a "test bed" (not sure how to call it right), to test high >> availability clusters based on Pacemaker and PostgreSQL. The test bed consist >> of 12 virtual machines (on VirtualBox) runing on a MacBook Pro and formed 4 >> HA clusters with different structure. And all 4 HA cluster constantly tested >> in loop: simulated failures with different nature, waited for rising >> fall-over, fixing, and so on. For simplicity I'll explain only one HA >> cluster. >> This is 3 virtual machines, with master on one, and sync and async >> slaves on other. The PostgreSQL service is provided by float IPs pointed to >> working master and slaves. Slaves are connected to the master float IP too. >> When the pacemaker detects a failure, for instance, on the master, it promote >> a master on other node with lowest latency WAL and switches float IPs, so the >> third node keeping be a sync slave. My company decided to open this project >> as an open source, now I am finishing formality. > > As the maintainer of PAF[1], I'm looking forward to discover it :) > Do not hesitate to ping me offlist as well in regard with Pacemaker and > resource agents. I am glad to see you here. Great thanks for PAF module. > >> Almost works fine, but sometimes, rather rare, I detected that a slave don't >> reconnect to the new master after a failure. First case is PostgreSQL-STOP, >> when I `kill` by STOP signal postgres on the master to simulate freeze. The >> slave don't reconnect to the new master with errors in log: >> >> 18:02:56.236 [3154] FATAL: terminating walreceiver due to timeout >> 18:02:56.237 [1421] LOG: record with incorrect prev-link 0/1600DDE8 at >> 0/1A00DE10 > > Do you have more logs from both side of the replication? There was nothing special. Just errors from previous tests. > How do you build your standbys? Okey, I'll show you all configs. Linux Centos 7, PostgreSQL 11 from the official postgres repository. # rpm -q corosync corosync-2.4.5-4.el7.x86_64 # rpm -q pacemaker pacemaker-1.1.21-4.el7.x86_64 # rpm -q sbd sbd-1.4.0-15.el7.x86_64 The cluster designed for three datacenter with good connection between in one city , so it must survive a crush of one datacenter. So stonith is not used, instead I use quorum and sbd based watchdog. Name of nodes: tuchanka3a, tuchanka3b, tuchanka3c. Name for float ip: krogan3 for the master; krogan3s1, krogan3s2 for slaves. postgresql common conf (not default options): ident_file = '/var/lib/pgsql/pg_ident.conf' hba_file = '/var/lib/pgsql/pg_hba.conf' listen_addresses = '*' log_filename = 'postgresql.%F.log' wal_keep_segments = 1 restart_after_crash = off wal_receiver_timeout=0 # in case PostgreSQL-STOP wal_receiver_timeout is default 60s, of cause shared_buffers = 32MB max_wal_size=80MB postgresql conf specific for nodes: for tuchanka3a: $ cat krogan3.conf synchronous_commit = remote_write synchronous_standby_names = 'ANY 1 (tuchanka3b,tuchanka3c)' $ cat krogan3.paf primary_conninfo = 'host=krogan3 user=replicant application_name=tuchanka3a sslmode=disable' recovery_target_timeline = 'latest' standby_mode = 'on' for tucanka3b: $ cat krogan3.conf synchronous_commit = remote_write synchronous_standby_names = 'ANY 1 (tuchanka3a,tuchanka3c)' $ cat krogan3.paf primary_conninfo = 'host=krogan3 user=replicant application_name=tuchanka3b sslmode=disable' recovery_target_timeline = 'latest' standby_mode = 'on' for tuchanka3c: $ cat krogan3.conf synchronous_commit = remote_write synchronous_standby_names = 'ANY 1 (tuchanka3a,tuchanka3b)' $ cat krogan3.paf primary_conninfo = 'host=krogan3 user=replicant application_name=tuchanka3c sslmode=disable' recovery_target_timeline = 'latest' standby_mode = 'on' pacemaker config, specific for this cluster: pcs cluster cib original_cib.xml cp original_cib.xml configured_cib.xml pcs -f configured_cib.xml resource create krogan3DB ocf:heartbeat:pgsqlms bindir=/usr/pgsql-11/bin pgdata=/var/lib/pgsql/krogan3 recovery_template=/var/lib/pgsql/krogan3.paf op monitor interval=15 timeout=10 monitor interval=16 role=Master timeout=20 monitor interval=17 role=Slave timeout=10 meta master notify=true resource-stickiness=10 pcs -f configured_cib.xml resource create krogan3IP ocf:heartbeat:IPaddr2 nic=eth0 cidr_netmask=24 ip=192.168.89.35 pcs -f configured_cib.xml resource create krogan3s1IP ocf:heartbeat:IPaddr2 nic=eth0 cidr_netmask=24 ip=192.168.89.36 pcs -f configured_cib.xml resource create krogan3s2IP ocf:heartbeat:IPaddr2 nic=eth0 cidr_netmask=24 ip=192.168.89.37 pcs -f configured_cib.xml constraint colocation add krogan3IP with master krogan3DB-master INFINITY pcs -f configured_cib.xml constraint order promote krogan3DB-master then start krogan3IP symmetrical=false pcs -f configured_cib.xml constraint order demote krogan3DB-master then stop krogan3IP symmetrical=false kind=Optional pcs -f configured_cib.xml constraint location krogan3s1IP rule score=-INFINITY master-krogan3DB lt integer 0 pcs -f configured_cib.xml constraint location krogan3s2IP rule score=-INFINITY master-krogan3DB lt integer 0 pcs -f configured_cib.xml constraint colocation add krogan3s1IP with slave krogan3DB-master INFINITY pcs -f configured_cib.xml constraint colocation add krogan3s2IP with slave krogan3DB-master INFINITY pcs -f configured_cib.xml constraint colocation add krogan3s1IP with krogan3s2IP -1000 pcs -f configured_cib.xml constraint order start krogan3DB-master then start krogan3s1IP pcs -f configured_cib.xml constraint order start krogan3DB-master then start krogan3s2IP pcs cluster cib-push configured_cib.xml --wait diff-against=original_cib.xml Test PostgreSQL-STOP (wal_receiver_timeout is default 60s): killall -STOP postgres PostgreSQL logs: >From old master, tuchanka 3a: 2020-08-20 10:03:35.633 MSK [1359] LOG: received promote request 2020-08-20 10:03:35.633 MSK [1507] FATAL: terminating walreceiver process due to administrator command 2020-08-20 10:03:35.634 MSK [1359] LOG: redo is not required 2020-08-20 10:03:35.636 MSK [1359] LOG: selected new timeline ID: 3 2020-08-20 10:03:35.703 MSK [1359] LOG: archive recovery complete 2020-08-20 10:03:35.712 MSK [1357] LOG: database system is ready to accept connections 2020-08-20 10:03:36.704 MSK [1640] LOG: standby "tuchanka3c" is now a candidate for quorum synchronous standby 2020-08-20 10:03:36.818 MSK [1641] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:05:03.399 MSK [1641] LOG: terminating walsender process due to replication timeout 2020-08-20 10:05:31.829 MSK [2564] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:07:07.183 MSK [1640] LOG: terminating walsender process due to replication timeout 2020-08-20 10:08:01.798 MSK [3813] LOG: standby "tuchanka3c" is now a candidate for quorum synchronous standby 2020-08-20 10:09:36.798 MSK [3813] LOG: terminating walsender process due to replication timeout 2020-08-20 10:10:29.449 MSK [4975] LOG: standby "tuchanka3c" is now a candidate for quorum synchronous standby 2020-08-20 10:12:05.088 MSK [2564] LOG: terminating walsender process due to replication timeout 2020-08-20 10:12:28.589 MSK [5977] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:14:15.063 MSK [5977] LOG: terminating walsender process due to replication timeout 2020-08-20 10:14:37.883 MSK [7071] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:16:29.821 MSK [7071] LOG: terminating walsender process due to replication timeout 2020-08-20 10:16:59.105 MSK [8236] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:18:39.647 MSK [4975] LOG: terminating walsender process due to replication timeout 2020-08-20 10:20:27.047 MSK [9855] LOG: standby "tuchanka3c" is now a candidate for quorum synchronous standby 2020-08-20 10:22:13.899 MSK [8236] LOG: terminating walsender process due to replication timeout 2020-08-20 10:22:37.731 MSK [10933] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:24:28.857 MSK [10933] LOG: terminating walsender process due to replication timeout 2020-08-20 10:24:57.420 MSK [12084] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby All errors are from former test, which was applied to slaves. (The host for tests choosen randomly). New master tuchanka3b, some logs are copied from the old master on database restoration after previous test by pg_basebackup. I separated them by blank line. 2020-08-20 10:03:35.633 MSK [1359] LOG: received promote request 2020-08-20 10:03:35.633 MSK [1507] FATAL: terminating walreceiver process due to administrator command 2020-08-20 10:03:35.634 MSK [1359] LOG: redo is not required 2020-08-20 10:03:35.636 MSK [1359] LOG: selected new timeline ID: 3 2020-08-20 10:03:35.703 MSK [1359] LOG: archive recovery complete 2020-08-20 10:03:35.712 MSK [1357] LOG: database system is ready to accept connections 2020-08-20 10:03:36.704 MSK [1640] LOG: standby "tuchanka3c" is now a candidate for quorum synchronous standby 2020-08-20 10:03:36.818 MSK [1641] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:05:03.399 MSK [1641] LOG: terminating walsender process due to replication timeout 2020-08-20 10:05:31.829 MSK [2564] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:07:07.183 MSK [1640] LOG: terminating walsender process due to replication timeout 2020-08-20 10:08:01.798 MSK [3813] LOG: standby "tuchanka3c" is now a candidate for quorum synchronous standby 2020-08-20 10:09:36.798 MSK [3813] LOG: terminating walsender process due to replication timeout 2020-08-20 10:10:29.449 MSK [4975] LOG: standby "tuchanka3c" is now a candidate for quorum synchronous standby 2020-08-20 10:12:05.088 MSK [2564] LOG: terminating walsender process due to replication timeout 2020-08-20 10:12:28.589 MSK [5977] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:14:15.063 MSK [5977] LOG: terminating walsender process due to replication timeout 2020-08-20 10:14:37.883 MSK [7071] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:16:29.821 MSK [7071] LOG: terminating walsender process due to replication timeout 2020-08-20 10:16:59.105 MSK [8236] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:18:39.647 MSK [4975] LOG: terminating walsender process due to replication timeout 2020-08-20 10:20:27.047 MSK [9855] LOG: standby "tuchanka3c" is now a candidate for quorum synchronous standby 2020-08-20 10:22:13.899 MSK [8236] LOG: terminating walsender process due to replication timeout 2020-08-20 10:22:37.731 MSK [10933] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:24:28.857 MSK [10933] LOG: terminating walsender process due to replication timeout 2020-08-20 10:24:55.888 MSK [1407] LOG: database system was interrupted; last known up at 2020-08-20 10:24:49 MSK 2020-08-20 10:24:55.906 MSK [1407] LOG: entering standby mode 2020-08-20 10:24:55.908 MSK [1407] LOG: redo starts at 0/15000028 2020-08-20 10:24:55.909 MSK [1407] LOG: consistent recovery state reached at 0/15002300 2020-08-20 10:24:55.910 MSK [1405] LOG: database system is ready to accept read only connections 2020-08-20 10:24:55.928 MSK [1411] LOG: started streaming WAL from primary at 0/16000000 on timeline 3 2020-08-20 10:26:37.308 MSK [1411] FATAL: terminating walreceiver due to timeout 2020-08-20 10:26:37.308 MSK [1407] LOG: invalid record length at 0/1600C4D8: wanted 24, got 0 2020-08-20 10:30:55.965 MSK [1407] LOG: received promote request 2020-08-20 10:30:55.965 MSK [2424] FATAL: terminating walreceiver process due to administrator command 2020-08-20 10:30:55.966 MSK [1407] LOG: redo done at 0/1600C4B0 2020-08-20 10:30:55.966 MSK [1407] LOG: last completed transaction was at log time 2020-08-20 10:25:38.764299+03 2020-08-20 10:30:55.968 MSK [1407] LOG: selected new timeline ID: 4 2020-08-20 10:30:56.001 MSK [1407] LOG: archive recovery complete 2020-08-20 10:30:56.005 MSK [1405] LOG: database system is ready to accept connections The slave with didn't reconnected replication, tuchanka3c. Also I separated logs copied from the old master by a blank line: 2020-08-20 10:03:35.633 MSK [1359] LOG: received promote request 2020-08-20 10:03:35.633 MSK [1507] FATAL: terminating walreceiver process due to administrator command 2020-08-20 10:03:35.634 MSK [1359] LOG: redo is not required 2020-08-20 10:03:35.636 MSK [1359] LOG: selected new timeline ID: 3 2020-08-20 10:03:35.703 MSK [1359] LOG: archive recovery complete 2020-08-20 10:03:35.712 MSK [1357] LOG: database system is ready to accept connections 2020-08-20 10:03:36.704 MSK [1640] LOG: standby "tuchanka3c" is now a candidate for quorum synchronous standby 2020-08-20 10:03:36.818 MSK [1641] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:05:03.399 MSK [1641] LOG: terminating walsender process due to replication timeout 2020-08-20 10:05:31.829 MSK [2564] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:07:07.183 MSK [1640] LOG: terminating walsender process due to replication timeout 2020-08-20 10:08:01.798 MSK [3813] LOG: standby "tuchanka3c" is now a candidate for quorum synchronous standby 2020-08-20 10:09:36.798 MSK [3813] LOG: terminating walsender process due to replication timeout 2020-08-20 10:10:29.449 MSK [4975] LOG: standby "tuchanka3c" is now a candidate for quorum synchronous standby 2020-08-20 10:12:05.088 MSK [2564] LOG: terminating walsender process due to replication timeout 2020-08-20 10:12:28.589 MSK [5977] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:14:15.063 MSK [5977] LOG: terminating walsender process due to replication timeout 2020-08-20 10:14:37.883 MSK [7071] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:16:29.821 MSK [7071] LOG: terminating walsender process due to replication timeout 2020-08-20 10:16:59.105 MSK [8236] LOG: standby "tuchanka3b" is now a candidate for quorum synchronous standby 2020-08-20 10:18:39.647 MSK [4975] LOG: terminating walsender process due to replication timeout 2020-08-20 10:20:25.168 MSK [1408] LOG: database system was interrupted; last known up at 2020-08-20 10:20:19 MSK 2020-08-20 10:20:25.180 MSK [1408] LOG: entering standby mode 2020-08-20 10:20:25.181 MSK [1408] LOG: redo starts at 0/11000098 2020-08-20 10:20:25.183 MSK [1408] LOG: consistent recovery state reached at 0/11000A68 2020-08-20 10:20:25.183 MSK [1406] LOG: database system is ready to accept read only connections 2020-08-20 10:20:25.193 MSK [1412] LOG: started streaming WAL from primary at 0/12000000 on timeline 3 2020-08-20 10:25:05.370 MSK [1955] LOG: could not send data to client: Connection reset by peer 2020-08-20 10:25:05.370 MSK [1955] CONTEXT: PL/pgSQL function beat4tmux(text) line 4 at RAISE 2020-08-20 10:25:05.370 MSK [1955] STATEMENT: call beat4tmux('3') 2020-08-20 10:25:05.370 MSK [1955] FATAL: connection to client lost 2020-08-20 10:25:05.370 MSK [1955] CONTEXT: PL/pgSQL function beat4tmux(text) line 4 at RAISE 2020-08-20 10:25:05.370 MSK [1955] STATEMENT: call beat4tmux('3') 2020-08-20 10:26:38.655 MSK [1412] FATAL: terminating walreceiver due to timeout 2020-08-20 10:26:38.655 MSK [1408] LOG: record with incorrect prev-link 0/1200C4B0 at 0/1600C4D8 The function beat4tmux used for health monitoring. The output from mon: Online: [ tuchanka3b tuchanka3c ] OFFLINE: [ tuchanka3a ] Full list of resources: Master/Slave Set: krogan3DB-master [krogan3DB] Masters: [ tuchanka3b ] Slaves: [ tuchanka3c ] Stopped: [ tuchanka3a ] krogan3IP (ocf::heartbeat:IPaddr2): Started tuchanka3b krogan3s1IP (ocf::heartbeat:IPaddr2): Stopped krogan3s2IP (ocf::heartbeat:IPaddr2): Stopped Node Attributes: * Node tuchanka3b: + master-krogan3DB : 1001 * Node tuchanka3c: + master-krogan3DB : -1000 Migration Summary: * Node tuchanka3b: * Node tuchanka3c: Pending Fencing Actions: * reboot of tuchanka3a pending: client=crmd.1242, origin=tuchanka3a >> Well, this can be workarouned by turning off wal_receiver_timeout. >> Now PostgreSQL-STOP works fine, but the problem is still >> exists with other test. ForkBomb simulates an out of memory situation. In >> this case a slave sometimes don't reconnect to the new master too, with >> errors in log: >> >> 10:09:43.99 [1417] FATAL: could not receive data from WAL stream: server >> closed the connection unexpectedly This probably means the server terminated >> abnormally before or while processing the request. >> 10:09:43.992 [1413] LOG: invalid record length at 0/D8014278: wanted 24, got >> 0 > > I suspect the problem is somewhere else. The first message here is probably > related to your primary being fenced, the second one is normal. After your > IP moved to the recently promoted primary, your standby are supposed to > reconnect with no problem. I believe that problem with the ForkBomb has the same nature. The case with PostgreSQL-STOP and default wal_receiver_timeout can be easily repeated, it rises always by `killall -STOP postgres` on the master, as I can see. But the case with wal_receiver_timeout=0 and ForkBomb happens rather rare, not always.