Re: 答复: Monitor repeatedly calling new election

Joao Eduardo Luis <joao@xxxxxxx> · Fri, 3 Feb 2017 11:22:57 +0000

On 02/03/2017 09:53 AM, 许雪寒 wrote:
Thanks for your quick reply:-)

I'm trying to send you more logs. Many of our online clusters has
been  ]running hammer version for a long time, it's a bit difficult for us to
> update those clusters since we are really afraid of encountering
problems during updating.

The benefits of running a recent version will outweigh the bugs you're 
being exposed to by running on 0.94.5.

From your log below, I see that my initial presumption about which 
monitor is the expected leader was wrong. It is rg3-ceph36 that's the 
expected leader, and I overlooked that fact in my previous email.

It seems that, for some reason, rg3-ceph36 is not participating in the 
election with rg3-ceph40:

2017-02-03 16:34:12.259244 7f4f301ca700 10 mon.rg3-ceph40@1(leader).log v98791  logging 2017-02-03 16:33:51.065135 mon.1 10.205.198.85:6789/0 7120 : cluster [INF] mon.rg3-ceph40@1 won leader election with quorum 1,2

Also, soon after that last election (see timestamp, election is won at 
16:33:51), mon.2 (which is rg3-ceph45) is calling for a new election:

2017-02-03 16:34:12.661159 7f4f301ca700  5 mon.rg3-ceph40@1(leader).elector(7024) handle_propose from mon.2
2017-02-03 16:34:12.661162 7f4f301ca700 10 mon.rg3-ceph40@1(leader).elector(7024) handle_propose required features 18416819765248, peer features 37154696925806591
2017-02-03 16:34:12.661165 7f4f301ca700 10 mon.rg3-ceph40@1(leader).elector(7024) bump_epoch 7024 to 7025
2017-02-03 16:34:12.673954 7f4f301ca700 10 mon.rg3-ceph40@1(leader) e1 join_election
2017-02-03 16:34:12.673965 7f4f301ca700 10 mon.rg3-ceph40@1(leader) e1 _reset

The reason why this election was called could only be found in 
rg3-ceph45's log, but I'm guessing this was an expired lease or something.

A few seconds after the current election expires without an elected leader:

2017-02-03 16:34:17.674611 7f4f309cb700  5 mon.rg3-ceph40@1(electing).elector(7025) election timer expired
2017-02-03 16:34:17.674634 7f4f309cb700  5 mon.rg3-ceph40@1(electing).elector(7025) start -- can i be leader?
2017-02-03 16:34:17.674853 7f4f309cb700  1 mon.rg3-ceph40@1(electing).elector(7025) init, last seen epoch 7025

and then mon.2 (rg3-ceph45) triggers a new election:

2017-02-03 16:34:17.700527 7f4f301ca700  5 mon.rg3-ceph40@1(electing).elector(7025) handle_propose from mon.2
2017-02-03 16:34:17.700817 7f4f301ca700 10 mon.rg3-ceph40@1(electing).elector(7025) handle_propose required features 18416819765248, peer features 37154696925806591
2017-02-03 16:34:17.700829 7f4f301ca700 10 mon.rg3-ceph40@1(electing).elector(7025) bump_epoch 7025 to 7027
2017-02-03 16:34:17.735332 7f4f301ca700 10 mon.rg3-ceph40@1(electing) e1 join_election
2017-02-03 16:34:17.735348 7f4f301ca700 10 mon.rg3-ceph40@1(electing) e1 _reset

These two things may have arguably have happened at the same time, but 
we (rg3-ceph40) may have ended up handling the election proposal from 
mon.2 *after* resetting the timer.

This would indicate that mon.2 is not handling our proposals, nor 
deferring to us.

Eventually mon.2 defers,

2017-02-03 16:34:24.117851 7f4f301ca700  5 mon.rg3-ceph40@1(electing).elector(7027) handle_ack from mon.2
2017-02-03 16:34:24.117856 7f4f301ca700  5 mon.rg3-ceph40@1(electing).elector(7027)  so far i have {1=37154696925806591,2=37154696925806591}
2017-02-03 16:34:27.736027 7f4f309cb700  5 mon.rg3-ceph40@1(electing).elector(7027) election timer expired
2017-02-03 16:34:27.736058 7f4f309cb700 10 mon.rg3-ceph40@1(electing).elector(7027) bump_epoch 7027 to 7028
2017-02-03 16:34:27.776912 7f4f309cb700 10 mon.rg3-ceph40@1(electing) e1 join_election

and we win the election

2017-02-03 16:34:27.777010 7f4f309cb700 10 mon.rg3-ceph40@1(electing) e1 win_election epoch 7028 quorum 1,2 features 37154696925806591

Only to timeout while recovering, a few seconds later:

2017-02-03 16:34:37.787796 7f4f309cb700  1 mon.rg3-ceph40@1(leader).paxos(paxos recovering c 200550..201290) collect timeout, calling fresh election
2017-02-03 16:34:37.787843 7f4f309cb700 10 mon.rg3-ceph40@1(probing).paxos(paxos recovering c 200550..201290) restart -- canceling timeouts

which seems to indicate that mon.2 is not replying to the recovery 
messages, or that we are not getting to them in time.

In any case, there seems to be some weirdness going on with your 
network. If you can rule out that's not the case, and that this may just 
be overloaded monitors not being able to get to the messages in time, 
you can try increasing the following timeouts:

* mon_lease (default: 5 seconds)
* mon_lease_ack_timeout (default: 10 seconds)
* mon_accept_timeout (default: 10 seconds)

If you choose to increase the timeouts, try applying a factor to all 
three of them (e.g., multiply them all by 2).

However, I have a feeling that this may not be the issue.

After all, you do have three monitors, and rg-ceph36 is nowhere to be 
seen in these election rounds.

You should check if rg3-ceph40 is able to properly communicate with both 
rg3-ceph45 and rg3-ceph36; same goes for rg3-ceph36 and rg3-ceph45.

And after checking a few of the patches that have made their way into 
more recent hammer versions, I would also recommend you check if you 
have available space on your monitor data directories; see

  d4e4d85c005ff68cb0461e9d7f0cbe5e6cf2f073
  mon/Elector: do a trivial write on every election cycle

which was backported to 0.94.6.

Again, should this problem persist, you should definitely set 'debug ms 
= 1' and check what's going on.

  -Joao
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com