On 02/03/2017 09:53 AM, 许雪寒 wrote:
Thanks for your quick reply:-)
I'm trying to send you more logs. Many of our online clusters has
been ]running hammer version for a long time, it's a bit difficult for us to
> update those clusters since we are really afraid of encountering
problems during updating.
The benefits of running a recent version will outweigh the bugs you're
being exposed to by running on 0.94.5.
From your log below, I see that my initial presumption about which
monitor is the expected leader was wrong. It is rg3-ceph36 that's the
expected leader, and I overlooked that fact in my previous email.
It seems that, for some reason, rg3-ceph36 is not participating in the
election with rg3-ceph40:
2017-02-03 16:34:12.259244 7f4f301ca700 10 mon.rg3-ceph40@1(leader).log v98791 logging 2017-02-03 16:33:51.065135 mon.1 10.205.198.85:6789/0 7120 : cluster [INF] mon.rg3-ceph40@1 won leader election with quorum 1,2
Also, soon after that last election (see timestamp, election is won at
16:33:51), mon.2 (which is rg3-ceph45) is calling for a new election:
2017-02-03 16:34:12.661159 7f4f301ca700 5 mon.rg3-ceph40@1(leader).elector(7024) handle_propose from mon.2
2017-02-03 16:34:12.661162 7f4f301ca700 10 mon.rg3-ceph40@1(leader).elector(7024) handle_propose required features 18416819765248, peer features 37154696925806591
2017-02-03 16:34:12.661165 7f4f301ca700 10 mon.rg3-ceph40@1(leader).elector(7024) bump_epoch 7024 to 7025
2017-02-03 16:34:12.673954 7f4f301ca700 10 mon.rg3-ceph40@1(leader) e1 join_election
2017-02-03 16:34:12.673965 7f4f301ca700 10 mon.rg3-ceph40@1(leader) e1 _reset
The reason why this election was called could only be found in
rg3-ceph45's log, but I'm guessing this was an expired lease or something.
A few seconds after the current election expires without an elected leader:
2017-02-03 16:34:17.674611 7f4f309cb700 5 mon.rg3-ceph40@1(electing).elector(7025) election timer expired
2017-02-03 16:34:17.674634 7f4f309cb700 5 mon.rg3-ceph40@1(electing).elector(7025) start -- can i be leader?
2017-02-03 16:34:17.674853 7f4f309cb700 1 mon.rg3-ceph40@1(electing).elector(7025) init, last seen epoch 7025
and then mon.2 (rg3-ceph45) triggers a new election:
2017-02-03 16:34:17.700527 7f4f301ca700 5 mon.rg3-ceph40@1(electing).elector(7025) handle_propose from mon.2
2017-02-03 16:34:17.700817 7f4f301ca700 10 mon.rg3-ceph40@1(electing).elector(7025) handle_propose required features 18416819765248, peer features 37154696925806591
2017-02-03 16:34:17.700829 7f4f301ca700 10 mon.rg3-ceph40@1(electing).elector(7025) bump_epoch 7025 to 7027
2017-02-03 16:34:17.735332 7f4f301ca700 10 mon.rg3-ceph40@1(electing) e1 join_election
2017-02-03 16:34:17.735348 7f4f301ca700 10 mon.rg3-ceph40@1(electing) e1 _reset
These two things may have arguably have happened at the same time, but
we (rg3-ceph40) may have ended up handling the election proposal from
mon.2 *after* resetting the timer.
This would indicate that mon.2 is not handling our proposals, nor
deferring to us.
Eventually mon.2 defers,
2017-02-03 16:34:24.117851 7f4f301ca700 5 mon.rg3-ceph40@1(electing).elector(7027) handle_ack from mon.2
2017-02-03 16:34:24.117856 7f4f301ca700 5 mon.rg3-ceph40@1(electing).elector(7027) so far i have {1=37154696925806591,2=37154696925806591}
2017-02-03 16:34:27.736027 7f4f309cb700 5 mon.rg3-ceph40@1(electing).elector(7027) election timer expired
2017-02-03 16:34:27.736058 7f4f309cb700 10 mon.rg3-ceph40@1(electing).elector(7027) bump_epoch 7027 to 7028
2017-02-03 16:34:27.776912 7f4f309cb700 10 mon.rg3-ceph40@1(electing) e1 join_election
and we win the election
2017-02-03 16:34:27.777010 7f4f309cb700 10 mon.rg3-ceph40@1(electing) e1 win_election epoch 7028 quorum 1,2 features 37154696925806591
Only to timeout while recovering, a few seconds later:
2017-02-03 16:34:37.787796 7f4f309cb700 1 mon.rg3-ceph40@1(leader).paxos(paxos recovering c 200550..201290) collect timeout, calling fresh election
2017-02-03 16:34:37.787843 7f4f309cb700 10 mon.rg3-ceph40@1(probing).paxos(paxos recovering c 200550..201290) restart -- canceling timeouts
which seems to indicate that mon.2 is not replying to the recovery
messages, or that we are not getting to them in time.
In any case, there seems to be some weirdness going on with your
network. If you can rule out that's not the case, and that this may just
be overloaded monitors not being able to get to the messages in time,
you can try increasing the following timeouts:
* mon_lease (default: 5 seconds)
* mon_lease_ack_timeout (default: 10 seconds)
* mon_accept_timeout (default: 10 seconds)
If you choose to increase the timeouts, try applying a factor to all
three of them (e.g., multiply them all by 2).
However, I have a feeling that this may not be the issue.
After all, you do have three monitors, and rg-ceph36 is nowhere to be
seen in these election rounds.
You should check if rg3-ceph40 is able to properly communicate with both
rg3-ceph45 and rg3-ceph36; same goes for rg3-ceph36 and rg3-ceph45.
And after checking a few of the patches that have made their way into
more recent hammer versions, I would also recommend you check if you
have available space on your monitor data directories; see
d4e4d85c005ff68cb0461e9d7f0cbe5e6cf2f073
mon/Elector: do a trivial write on every election cycle
which was backported to 0.94.6.
Again, should this problem persist, you should definitely set 'debug ms
= 1' and check what's going on.
-Joao
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com