Re: Handling is_readable=0 periods in mon

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 24 May 2017 14:31:00 -0700

On Wed, May 24, 2017 at 2:15 PM John Spray <jspray@xxxxxxxxxx> wrote:
>
> On Wed, May 24, 2017 at 9:45 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> > On Wed, May 24, 2017 at 3:20 AM John Spray <jspray@xxxxxxxxxx> wrote:
> >>
> >> On Wed, May 24, 2017 at 6:12 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> >> > On Tue, May 23, 2017 at 10:31 AM John Spray <jspray@xxxxxxxxxx> wrote:
> >> >>
> >> >> Hi all,
> >> >>
> >> >> I could use some help from people who understand the mon better than I
> >> >> do with this ticket: http://tracker.ceph.com/issues/19706
> >> >>
> >> >> The issue is that MDSMonitor is incorrectly killing MDSs because it
> >> >> hasn't seen beacon messages, but the beacon messages are actually just
> >> >> held up because is_readable = 0, like this:
> >> >> 2017-05-23 13:34:20.054785 7f772f1c2700 10
> >> >> mon.b@0(leader).paxosservice(mdsmap 1..11) dispatch 0x7f7742989740
> >> >> mdsbeacon(4141/a up:active seq 96 v9) v7 from mds.0
> >> >> 172.21.15.77:6809/2700711429 con 0x7f77428d8f00
> >> >> 2017-05-23 13:34:20.054788 7f772f1c2700  5 mon.b@0(leader).paxos(paxos
> >> >> recovering c 1..293) is_readable = 0 - now=2017-05-23 13:34:20.054789
> >> >> lease_expire=0.000000 has v0 lc 293
> >> >> 2017-05-23 13:34:20.054791 7f772f1c2700 10
> >> >> mon.b@0(leader).paxosservice(mdsmap 1..11)  waiting for paxos ->
> >> >> readable (v9)
> >> >>
> >> >> This appears to be happening when one or more mons are a bit laggy,
> >> >> but it is happening before an election has happened.
> >> >>
> >> >> We have code for handling slow elections by checking how long it has
> >> >> been since the last tick, and resetting our timeout information for
> >> >> MDS beacons if it has been too long
> >> >> (https://github.com/ceph/ceph/blob/master/src/mon/MDSMonitor.cc#L2070)
> >> >>
> >> >>
> >> >> However, in this case the tick() function is getting called
> >> >> throughout, we're just not seeing the beacons because they're held up
> >> >> waiting for readable.
> >> >
> >> >
> >> > This story doesn't seem quite right/complete This code is only invoked
> >> > if the monitor is the leader, which certainly doesn't happen when the
> >> > monitor is out of quorum. Is the beacon maybe going to a peon which
> >> > isn't forwarding quickly enough?
> >>
> >> That was what I thought (the messages are indeed being forwarded and
> >> getting held up a bit there), but then I looked at the leader's log
> >> and they were getting held up there too.
> >
> >
> > Okay, so it's not that one peon is laggy but that the monitors are
> > knowingly not in a quorum, including the one who is leader on both
> > sides of the election (but not while things are laggy!). And then when
> > the election happens, the leader does a tick and notices it hasn't
> > gotten a beacon from any MDSes in the past timeout interval.
>
> That's what I thought too, until I noticed that the elections were
> only happening *after* we've had our tick() and mistakenly killed an
> MDS.  There's this period where the beacons are getting ignored, but
> our tick() is still getting called.
>
> I've snipped out the timeline in an attachment to this mail (easier to
> read in a nice widescreen text editor than most mail clients).

Okay, now I get it.

1) leader is not readable because there's a proposal pending
(presumably, waiting on some slow peon)
2) leader receives forwarded beacons, holds off until readable
3) leader ticks() and says it hasn't processed a beacon in an
acceptable amount of time so it kills MDS
4) peon finally times out so an election is called

So, yeah, I don't think we have any good patterns established there.
You could presumably narrow the race by not doing the eviction
processing while not readable, but I don't have any particular reason
to think that closes it. You could perhaps try and track unprocessed
beacons, and not evaluate it until you've seen them all?

(FYI: attachments won't go through vger at all — I imagine you got a
bounce back from it. Use pastebin or whatever.)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html