fsync
... taking more knowledge around this will shed some light to understand this problem "slave ahead of master"
"there was silence, because master hang."
... replication halted here, master holds the latest copy which is missing at both the slaves
"I decided to promote one of the slaves"
... only 2 slaves are left, and one among them is going to be the master now, in master/slave nomenclature the data from master is considered as the valid one from this point onward
... you mentioned that the original master comes up in the mean time one of the slave was already a master
On Wed, Mar 19, 2014 at 10:40 PM, Evgeniy Shishkin <itparanoia@xxxxxxxxx> wrote:
Hello,
we have 3 servers with postgresql 9.3.3. One is master and two slaves.
We run synchronous_replication and fsync, synchronous_commit and full_page_writes are on.
Suddenly master hang up with hardware failure, it is a strange bug in iLo which we investigate with HP.
Before master was rebooted, i ran ps aux on slave
postgres: wal receiver process streaming 12/F1031DF8
Last messages in slaves logs was
2014-03-19 02:41:29.005 GMT,,,7389,,53108c69.1cdd,16029,,2014-02-28 13:17:29 GMT,,0,LOG,00000,"recovery restart point at 12/DFFBB3E8","last completed transaction was at log time 2014-03-19 02:41:28.886869+00",,,,,,,,""
and then there was silence, because master hang.
Then master was rebooted and slave wrote in log
2014-03-19 15:36:39.176 GMT,,,7392,,53108c69.1ce0,2,,2014-02-28 13:17:29 GMT,,0,FATAL,XX000,"terminating walreceiver due to timeout",,,,,,,,,""
2014-03-19 15:36:39.177 GMT,,,7388,,53108c69.1cdc,6,,2014-02-28 13:17:29 GMT,1/0,0,LOG,00000,"record with zero length at 12/F1031DF8",,,,,,,,,""
2014-03-19 15:36:57.181 GMT,,,12100,,5329b996.2f44,1,,2014-03-19 15:36:54 GMT,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: No route to host
Is the server running on host ""10.162.2.50"" and accepting
TCP/IP connections on port 5432?
",,,,,,,,,""
Then master finally came back, slave wrote
2014-03-19 15:40:09.389 GMT,,,13121,,5329ba59.3341,1,,2014-03-19 15:40:09 GMT,,0,FATAL,XX000,"could not connect to the primary server: FATAL: the database system is starting up
",,,,,,,,,""
2014-03-19 15:40:16.468 GMT,,,13136,,5329ba5e.3350,1,,2014-03-19 15:40:14 GMT,,0,LOG,00000,"started streaming WAL from primary at 12/F1000000 on timeline 1",,,,,,,,,""
2014-03-19 15:40:16.468 GMT,,,13136,,5329ba5e.3350,2,,2014-03-19 15:40:14 GMT,,0,FATAL,XX000,"could not receive data from WAL stream: ERROR: requested starting point 12/F1000000 is ahead of the WAL flush position of this server 12/F0FFFCE8
",,,,,,,,,""
last message was repeated several times
and then this happened
2014-03-19 15:42:04.623 GMT,,,13722,,5329bacc.359a,1,,2014-03-19 15:42:04 GMT,,0,LOG,00000,"started streaming WAL from primary at 12/F1000000 on timeline 1",,,,,,,,,""
2014-03-19 15:42:04.628 GMT,,,7388,,53108c69.1cdc,7,,2014-02-28 13:17:29 GMT,1/0,0,LOG,00000,"invalid record length at 12/F1031DF8",,,,,,,,,""
2014-03-19 15:42:04.628 GMT,,,13722,,5329bacc.359a,2,,2014-03-19 15:42:04 GMT,,0,FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,,""
2014-03-19 15:42:09.628 GMT,,,7388,,53108c69.1cdc,8,,2014-02-28 13:17:29 GMT,1/0,0,LOG,00000,"invalid record length at 12/F1031DF8",,,,,,,,,""
2014-03-19 15:42:14.628 GMT,,,7388,,53108c69.1cdc,9,,2014-02-28 13:17:29 GMT,1/0,0,LOG,00000,"invalid record length at 12/F1031DF8",,,,,,,,,""
2014-03-19 15:42:19.628 GMT,,,7388,,53108c69.1cdc,10,,2014-02-28 13:17:29 GMT,1/0,0,LOG,00000,"invalid record length at 12/F1031DF8",,,,,,,,,”"
and it just repeats forever.
Meanwhile on master
2014-03-19 15:39:30.957 GMT,,,7115,,5329ba32.1bcb,2,,2014-03-19 15:39:30 GMT,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,,""
2014-03-19 15:39:30.989 GMT,,,7115,,5329ba32.1bcb,3,,2014-03-19 15:39:30 GMT,,0,LOG,00000,"redo starts at 12/DFFBB3E8",,,,,,,,,""
2014-03-19 15:39:47.114 GMT,,,7115,,5329ba32.1bcb,4,,2014-03-19 15:39:30 GMT,,0,LOG,00000,"redo done at 12/F0FFFC38",,,,,,,,,""
2014-03-19 15:39:47.114 GMT,,,7115,,5329ba32.1bcb,5,,2014-03-19 15:39:30 GMT,,0,LOG,00000,"last completed transaction was at log time 2014-03-19 05:02:29.273138+00",,,,,,,,,""
2014-03-19 15:39:47.115 GMT,,,7115,,5329ba32.1bcb,6,,2014-03-19 15:39:30 GMT,,0,LOG,00000,"checkpoint starting: end-of-recovery immediate",,,,,,,,,""
2014-03-19 15:40:16.466 GMT,"replicator","",7986,"10.162.2.52:44336",5329ba5e.1f32,1,"idle",2014-03-19 15:40:14 GMT,2/0,0,ERROR,XX000,"requested starting point 12/F1000000 is ahead of the WAL flush position of this server 12/F0FFFCE8",,,,,,,,,"walreceiver"
So, all two slaves are disconnected from master, which somehow is past his slaves.
I decided to promote one of the slaves, so we can have some snapshot of the data.
relevant logs from this are
2014-03-19 16:50:43.118 GMT,,,4444,,5329cae3.115c,3,,2014-03-19 16:50:43 GMT,,0,LOG,00000,"redo starts at 12/DFFBB3E8",,,,,,,,,""
2014-03-19 16:50:50.028 GMT,"dboperator","postgres",4452,"[local]",5329caea.1164,1,"",2014-03-19 16:50:50 GMT,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2014-03-19 16:50:51.128 GMT,,,4444,,5329cae3.115c,4,,2014-03-19 16:50:43 GMT,,0,LOG,00000,"invalid contrecord length 5736 at 12/F0FFFC80",,,,,,,,,""
2014-03-19 16:50:51.128 GMT,,,4444,,5329cae3.115c,5,,2014-03-19 16:50:43 GMT,,0,LOG,00000,"redo done at 12/F0FFFC38",,,,,,,,,””
It is interesting that redo done at 12/F0FFFC38 both on master and promoted slave.
The main question is there is actual latest data, and how is it possible that master is behind his slave in synchronous replication.
Thanks for the help.
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance