On Wed, May 24, 2017 at 2:15 PM John Spray <jspray@xxxxxxxxxx> wrote: > > On Wed, May 24, 2017 at 9:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > > On Wed, May 24, 2017 at 3:20 AM John Spray <jspray@xxxxxxxxxx> wrote: > >> > >> On Wed, May 24, 2017 at 6:12 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > >> > On Tue, May 23, 2017 at 10:31 AM John Spray <jspray@xxxxxxxxxx> wrote: > >> >> > >> >> Hi all, > >> >> > >> >> I could use some help from people who understand the mon better than I > >> >> do with this ticket: http://tracker.ceph.com/issues/19706 > >> >> > >> >> The issue is that MDSMonitor is incorrectly killing MDSs because it > >> >> hasn't seen beacon messages, but the beacon messages are actually just > >> >> held up because is_readable = 0, like this: > >> >> 2017-05-23 13:34:20.054785 7f772f1c2700 10 > >> >> mon.b@0(leader).paxosservice(mdsmap 1..11) dispatch 0x7f7742989740 > >> >> mdsbeacon(4141/a up:active seq 96 v9) v7 from mds.0 > >> >> 172.21.15.77:6809/2700711429 con 0x7f77428d8f00 > >> >> 2017-05-23 13:34:20.054788 7f772f1c2700 5 mon.b@0(leader).paxos(paxos > >> >> recovering c 1..293) is_readable = 0 - now=2017-05-23 13:34:20.054789 > >> >> lease_expire=0.000000 has v0 lc 293 > >> >> 2017-05-23 13:34:20.054791 7f772f1c2700 10 > >> >> mon.b@0(leader).paxosservice(mdsmap 1..11) waiting for paxos -> > >> >> readable (v9) > >> >> > >> >> This appears to be happening when one or more mons are a bit laggy, > >> >> but it is happening before an election has happened. > >> >> > >> >> We have code for handling slow elections by checking how long it has > >> >> been since the last tick, and resetting our timeout information for > >> >> MDS beacons if it has been too long > >> >> (https://github.com/ceph/ceph/blob/master/src/mon/MDSMonitor.cc#L2070) > >> >> > >> >> > >> >> However, in this case the tick() function is getting called > >> >> throughout, we're just not seeing the beacons because they're held up > >> >> waiting for readable. > >> > > >> > > >> > This story doesn't seem quite right/complete This code is only invoked > >> > if the monitor is the leader, which certainly doesn't happen when the > >> > monitor is out of quorum. Is the beacon maybe going to a peon which > >> > isn't forwarding quickly enough? > >> > >> That was what I thought (the messages are indeed being forwarded and > >> getting held up a bit there), but then I looked at the leader's log > >> and they were getting held up there too. > > > > > > Okay, so it's not that one peon is laggy but that the monitors are > > knowingly not in a quorum, including the one who is leader on both > > sides of the election (but not while things are laggy!). And then when > > the election happens, the leader does a tick and notices it hasn't > > gotten a beacon from any MDSes in the past timeout interval. > > That's what I thought too, until I noticed that the elections were > only happening *after* we've had our tick() and mistakenly killed an > MDS. There's this period where the beacons are getting ignored, but > our tick() is still getting called. > > I've snipped out the timeline in an attachment to this mail (easier to > read in a nice widescreen text editor than most mail clients). Okay, now I get it. 1) leader is not readable because there's a proposal pending (presumably, waiting on some slow peon) 2) leader receives forwarded beacons, holds off until readable 3) leader ticks() and says it hasn't processed a beacon in an acceptable amount of time so it kills MDS 4) peon finally times out so an election is called So, yeah, I don't think we have any good patterns established there. You could presumably narrow the race by not doing the eviction processing while not readable, but I don't have any particular reason to think that closes it. You could perhaps try and track unprocessed beacons, and not evaluate it until you've seen them all? (FYI: attachments won't go through vger at all — I imagine you got a bounce back from it. Use pastebin or whatever.) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html