On Tue, 18 Aug 2020 13:48:41 +0300 Олег Самойлов <splarv@xxxxx> wrote: > Hi all. > > I found some strange behaviour of postgres, which I recognise as a bug. First > of all, let me explain situation. > > I created a "test bed" (not sure how to call it right), to test high > availability clusters based on Pacemaker and PostgreSQL. The test bed consist > of 12 virtual machines (on VirtualBox) runing on a MacBook Pro and formed 4 > HA clusters with different structure. And all 4 HA cluster constantly tested > in loop: simulated failures with different nature, waited for rising > fall-over, fixing, and so on. For simplicity I'll explain only one HA > cluster. > This is 3 virtual machines, with master on one, and sync and async > slaves on other. The PostgreSQL service is provided by float IPs pointed to > working master and slaves. Slaves are connected to the master float IP too. > When the pacemaker detects a failure, for instance, on the master, it promote > a master on other node with lowest latency WAL and switches float IPs, so the > third node keeping be a sync slave. My company decided to open this project > as an open source, now I am finishing formality. As the maintainer of PAF[1], I'm looking forward to discover it :) Do not hesitate to ping me offlist as well in regard with Pacemaker and resource agents. > Almost works fine, but sometimes, rather rare, I detected that a slave don't > reconnect to the new master after a failure. First case is PostgreSQL-STOP, > when I `kill` by STOP signal postgres on the master to simulate freeze. The > slave don't reconnect to the new master with errors in log: > > 18:02:56.236 [3154] FATAL: terminating walreceiver due to timeout > 18:02:56.237 [1421] LOG: record with incorrect prev-link 0/1600DDE8 at > 0/1A00DE10 Do you have more logs from both side of the replication? How do you build your standbys? > What is strange that error about incorrect WAL is risen after the > termination of connection. This is because the first message comes from the walreceiver itself (3154), which receive and write WAL, and the other one comes from the startup process (1421) which wait and replay WAL. > Well, this can be workarouned by turning off wal > receiver timeout. Now PostgreSQL-STOP works fine, but the problem is still > exists with other test. ForkBomb simulates an out of memory situation. In > this case a slave sometimes don't reconnect to the new master too, with > errors in log: > > 10:09:43.99 [1417] FATAL: could not receive data from WAL stream: server > closed the connection unexpectedly This probably means the server terminated > abnormally before or while processing the request. > 10:09:43.992 [1413] LOG: invalid record length at 0/D8014278: wanted 24, got > 0 I suspect the problem is somewhere else. The first message here is probably related to your primary being fenced, the second one is normal. After your IP moved to the recently promoted primary, your standby are supposed to reconnect with no problem. > The last error message (last row in log) was observed different, btw. > > What I expect as right behaviour. The PostgreSQL slave must reconnect to the > master IP (float IP) after the wal_retrieve_retry_interval. In my own experience with PAF, it just works like what you describe. Regards, [1] https://clusterlabs.github.io/PAF/