Slave promotion problem...

marin@xxxxxxxx · Mon, 31 Aug 2015 08:29:13 +0200

Last week we had some problems on the master server which caused a 
failover on the slave (the master was completely unresponsive due to 
reasons still unknown). The slave received the promote signal (pg_ctl 
promote) and logged that in the logs:
2015-08-28 23:05:10 UTC [6]: [50-1] user=,db= LOG:  received promote 
request
2015-08-28 23:05:10 UTC [467]: [2-1] user=,db= FATAL:  terminating 
walreceiver process due to administrator command

5 hours later the slave still didn't promote. Meanwhile we fixed the 
master and restarted it. The slave was restarted and it behaved just 
like the promote signal didn't arrive, connecting to the master as a 
regular slave.

Because of maintenance we had to issue a failover a few days after, and 
this time the failover was successful:
2015-08-30 19:40:08 UTC [312]: [2-1] user=,db= LOG:  replication 
terminated by primary server
2015-08-30 19:40:08 UTC [312]: [3-1] user=,db= DETAIL:  End of WAL 
reached on timeline 3 at 1AC/4D000090.
2015-08-30 19:40:08 UTC [312]: [4-1] user=,db= FATAL:  could not send 
end-of-streaming message to primary: no COPY in progress
2015-08-30 19:40:08 UTC [6]: [34-1] user=,db= LOG:  invalid record 
length at 1AC/4D000090
2015-08-30 19:40:10 UTC [6]: [35-1] user=,db= LOG:  received promote 
request
2015-08-30 19:40:13 UTC [6]: [36-1] user=,db= LOG:  redo done at 
1AC/4D000028
2015-08-30 19:40:13 UTC [6]: [37-1] user=,db= LOG:  last completed 
transaction was at log time 2015-08-30 19:40:07.18114+00
2015-08-30 19:40:14 UTC [6]: [38-1] user=,db= LOG:  selected new 
timeline ID: 4
2015-08-30 19:40:14 UTC [6]: [39-1] user=,db= LOG:  restored log file 
"00000003.history" from archive
2015-08-30 19:40:14 UTC [6]: [40-1] user=,db= LOG:  archive recovery 
complete
2015-08-30 19:40:14 UTC [6]: [41-1] user=,db= LOG:  MultiXact member 
wraparound protections are now enabled
2015-08-30 19:40:14 UTC [29303]: [1-1] user=,db= LOG:  autovacuum 
launcher started
2015-08-30 19:40:14 UTC [1]: [4-1] user=,db= LOG:  database system is 
ready to accept connections

I am unsure if this promotion failure is a bug/glitch, but the promote 
procedure is automated and tested a couple of hundred times so I am 
certain we initiated the promote correctly.
Looking in the internet I haven't found anything similar. Does anybody 
know any reason why the slave didn't promote after receiving the promote 
signal? Looking at the data it seems like the slave aborted the promote 
process.

Both instances are 9.4.4 connected with streaming replication.

Regards,
Mladen Marinović

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general