Search Postgresql Archives

Re: Slave promotion problem...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2015-08-31 16:05, marin@xxxxxxxx wrote:
On 2015-08-31 14:38, Martín Marqués wrote:
El 31/08/15 a las 03:29, marin@xxxxxxxx escribió:
Last week we had some problems on the master server which caused a
failover on the slave (the master was completely unresponsive due to
reasons still unknown). The slave received the promote signal (pg_ctl
promote) and logged that in the logs:
2015-08-28 23:05:10 UTC [6]: [50-1] user=,db= LOG:  received promote
request
2015-08-28 23:05:10 UTC [467]: [2-1] user=,db= FATAL:  terminating
walreceiver process due to administrator command

5 hours later the slave still didn't promote. Meanwhile we fixed the
master and restarted it. The slave was restarted and it behaved just
like the promote signal didn't arrive, connecting to the master as a
regular slave.

Aren't there any further logs after the walreceiver termination?
Up to here everything looks fine, but we have no idea on what was logged
afterwards.
There are logs (quite a few, cca. 5 hours of it), every second
something like this:
2015-08-28 23:05:12 UTC [79867]: [1-1] user=[unknown],db=[unknown]
LOG:  connection received: host=[local]
2015-08-28 23:05:12 UTC [79867]: [2-1] user=postgres,db=postgres LOG:
connection authorized: user=postgres database=postgres
This logs the connection of the process that probes the server is alive.

I was expecting to see something like:
redo done at xxxxx
last completed transaction was at log time xxxxxxx

But those lines didn't appear after 5 hours. As I understand, these
are written before the server uses the restore_command to get WAL and
history files from the archive.

After careful testing of two scenarios:
1. Master is alive (or at least the slave thinks so), slave is signaled to promote
2. Master is dead, slave is signaled to promote.
We have come to a conclusion that in the scenario 1. the slave requires an additional WAL file from the archive before the "redo done at xxxxxx" message. Knowing that we established that the problem was in the restore_command script (wrapper around several rsync commands) which hanged indefinitely.



I am unsure if this promotion failure is a bug/glitch, but the promote
procedure is automated and tested a couple of hundred times so I am
certain we initiated the promote correctly.

Are you using homemade scripts? Maybe you need to test them more
thoroughly, with different environment parameters.

We use a custom script for the restore_command, but is seems that it
was not invoked.

Regards,
Mladen Marinović



--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux