Dear supporters, I'm writing some scripts to implement manual failover. I have two clusters(let's say p1 and p2), where one is primary(e.g. p1) and the other is standby(e.g. p2). The way to do manual failover is straight forward, like following: 1. promote on p2 2. wait `pg_is_ready()` on p2 3. rewind on p1 4. prepare a recovery.conf on p1 5. start p1 This should ends up with the same HA but role switched. It works find if I manually do each step. But if I call each step sequentially in a script, it will fail after I switched role for the 1st time and want to switch back. For example, with a fresh setup(timeline starts from 1), I firstly tried to switch role, and it works. I get p1 as standby following p2, which is the priamry. Then I switch role again and error occurs, the error message is like: < 2018-11-12 04:59:24.547 UTC > LOG: entering standby mode < 2018-11-12 04:59:24.555 UTC > LOG: redo starts at 0/4000028 < 2018-11-12 04:59:24.566 UTC > LOG: started streaming WAL from primary at 0/5000000 on timeline 1 < 2018-11-12 04:59:24.566 UTC > FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000020000000000000005 has already been removed < 2018-11-12 04:59:24.577 UTC > LOG: started streaming WAL from primary at 0/5000000 on timeline 1 < 2018-11-12 04:59:24.577 UTC > FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000020000000000000005 has already been removed < 2018-11-12 04:59:25.413 UTC > FATAL: the database system is starting up < 2018-11-12 04:59:26.416 UTC > FATAL: the database system is starting up < 2018-11-12 04:59:27.419 UTC > FATAL: the database system is starting up < 2018-11-12 04:59:28.422 UTC > FATAL: the database system is starting up < 2018-11-12 04:59:29.425 UTC > FATAL: the database system is starting up < 2018-11-12 04:59:29.576 UTC > LOG: started streaming WAL from primary at 0/5000000 on timeline 1 < 2018-11-12 04:59:29.576 UTC > FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000020000000000000005 has already been removed the pg_rewind output is as follow: servers diverged at WAL position 0/5000060 on timeline 1 rewinding from last common checkpoint at 0/4000060 on timeline 1 >From the log, it seems the wrong timeline of divergence is evaluated, it should be timeline 2 rather than 1. Furthermore, if I add a `sleep` between step 2(promote) and step 3(rewind), it just works. Hence, I suspect the promoted cluster is not ready to be used for rewinding right after promote. Is there anything I need to wait before I rewind the old primary against this promoted cluster? Thank you in advance! --- magodo