Search Postgresql Archives

BUG? Slave don't reconnect to the master

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all.

I found some strange behaviour of postgres, which I recognise as a bug. First of all, let me explain situation.

I created a "test bed" (not sure how to call it right), to test high availability clusters based on Pacemaker and PostgreSQL. The test bed consist of 12 virtual machines (on VirtualBox) runing on a MacBook Pro and formed 4 HA clusters with different structure. And all 4 HA cluster constantly tested in loop: simulated failures with different nature, waited for rising fall-over, fixing, and so on. For simplicity I'll explain only one HA cluster. This is 3 virtual machines, with master on one, and sync and async slaves on other. The PostgreSQL service is provided by float IPs pointed to working master and slaves. Slaves are connected to the master float IP too. When the pacemaker detects a failure, for instance, on the master, it promote a master on other node with lowest latency WAL and switches float IPs, so the third node keeping be a sync slave. My company decided to open this project as an open source, now I am finishing formality.

Almost works fine, but sometimes, rather rare, I detected that a slave don't reconnect to the new master after a failure. First case is PostgreSQL-STOP, when I `kill` by STOP signal postgres on the master to simulate freeze. The slave don't reconnect to the new master with errors in log:

18:02:56.236 [3154] FATAL:  terminating walreceiver due to timeout
18:02:56.237 [1421] LOG:  record with incorrect prev-link 0/1600DDE8 at 0/1A00DE10

What is strange that error about incorrect WAL is risen  after the termination of connection. Well, this can be workarouned by turning off wal receiver timeout. Now PostgreSQL-STOP works fine, but the problem is still exists with other test. ForkBomb simulates an out of memory situation. In this case a slave sometimes don't reconnect to the new master too, with errors in log:

10:09:43.99 [1417] FATAL:  could not receive data from WAL stream: server closed the connection unexpectedly
                This probably means the server terminated abnormally
                before or while processing the request.
10:09:43.992 [1413] LOG:  invalid record length at 0/D8014278: wanted 24, got 0

The last error message (last row in log) was observed different, btw.

What I expect as right behaviour. The PostgreSQL slave must reconnect to the master IP (float IP) after the wal_retrieve_retry_interval.





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux