Search Postgresql Archives

pg9.6 when is a promoted cluster ready to accept "rewind" request?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear supporters,

I'm writing some scripts to implement manual failover. I have two
clusters(let's say p1 and p2), where one is primary(e.g. p1) and the
other is standby(e.g. p2). The way to do manual failover is straight
forward, like following:

1. promote on p2
2. wait `pg_is_ready()` on p2
3. rewind on p1
4. prepare a recovery.conf on p1
5. start p1

This should ends up with the same HA but role switched.

It works find if I manually do each step. 

But if I call each step sequentially in a script, it will fail after I
switched role for the 1st time and want to switch back.

For example, with a fresh setup(timeline starts from 1), I firstly
tried to switch role, and it works. I get p1 as standby following p2,
which is the priamry. Then I switch role again and error occurs, the
error message is like:

   < 2018-11-12 04:59:24.547 UTC > LOG:  entering standby mode
   < 2018-11-12 04:59:24.555 UTC > LOG:  redo starts at 0/4000028
   < 2018-11-12 04:59:24.566 UTC > LOG:  started streaming WAL from
   primary at 0/5000000 on timeline 1
   < 2018-11-12 04:59:24.566 UTC > FATAL:  could not receive data from
   WAL stream: ERROR:  requested WAL segment 000000020000000000000005
   has already been
   removed                                                             
                                                      

   < 2018-11-12 04:59:24.577 UTC > LOG:  started streaming WAL from
   primary at 0/5000000 on timeline 1
   < 2018-11-12 04:59:24.577 UTC > FATAL:  could not receive data from
   WAL stream: ERROR:  requested WAL segment 000000020000000000000005
   has already been
   removed                                                             
                                                      

   < 2018-11-12 04:59:25.413 UTC > FATAL:  the database system is
   starting up
   < 2018-11-12 04:59:26.416 UTC > FATAL:  the database system is
   starting up
   < 2018-11-12 04:59:27.419 UTC > FATAL:  the database system is
   starting up
   < 2018-11-12 04:59:28.422 UTC > FATAL:  the database system is
   starting up
   < 2018-11-12 04:59:29.425 UTC > FATAL:  the database system is
   starting up
   < 2018-11-12 04:59:29.576 UTC > LOG:  started streaming WAL from
   primary at 0/5000000 on timeline 1
   < 2018-11-12 04:59:29.576 UTC > FATAL:  could not receive data from
   WAL stream: ERROR:  requested WAL segment 000000020000000000000005
   has already been removed              


the pg_rewind output is as follow:

   servers diverged at WAL position 0/5000060 on timeline 1         
   rewinding from last common checkpoint at 0/4000060 on timeline 1 

>From the log, it seems the wrong timeline of divergence is evaluated,
it should be timeline 2 rather than 1. 

Furthermore, if I add a `sleep` between step 2(promote) and step
3(rewind), it just works. 

Hence, I suspect the promoted cluster is not ready to be used for
rewinding right after promote. Is there anything I need to wait before
I rewind the old primary against this promoted cluster?

Thank you in advance!

---
magodo






[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux