John, i am noting down my findings on http://tracker.ceph.com/issues/19706 when reading jspray-2017-05-23_11:58:06-fs-wip-jcsp-testing-20170523-distro-basic-smithi/1221142. On Wed, May 24, 2017 at 1:12 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > On Tue, May 23, 2017 at 10:31 AM John Spray <jspray@xxxxxxxxxx> wrote: >> >> Hi all, >> >> I could use some help from people who understand the mon better than I >> do with this ticket: http://tracker.ceph.com/issues/19706 >> >> The issue is that MDSMonitor is incorrectly killing MDSs because it >> hasn't seen beacon messages, but the beacon messages are actually just >> held up because is_readable = 0, like this: >> 2017-05-23 13:34:20.054785 7f772f1c2700 10 >> mon.b@0(leader).paxosservice(mdsmap 1..11) dispatch 0x7f7742989740 >> mdsbeacon(4141/a up:active seq 96 v9) v7 from mds.0 >> 172.21.15.77:6809/2700711429 con 0x7f77428d8f00 >> 2017-05-23 13:34:20.054788 7f772f1c2700 5 mon.b@0(leader).paxos(paxos >> recovering c 1..293) is_readable = 0 - now=2017-05-23 13:34:20.054789 >> lease_expire=0.000000 has v0 lc 293 >> 2017-05-23 13:34:20.054791 7f772f1c2700 10 >> mon.b@0(leader).paxosservice(mdsmap 1..11) waiting for paxos -> >> readable (v9) >> >> This appears to be happening when one or more mons are a bit laggy, >> but it is happening before an election has happened. >> >> We have code for handling slow elections by checking how long it has >> been since the last tick, and resetting our timeout information for >> MDS beacons if it has been too long >> (https://github.com/ceph/ceph/blob/master/src/mon/MDSMonitor.cc#L2070) >> >> >> However, in this case the tick() function is getting called >> throughout, we're just not seeing the beacons because they're held up >> waiting for readable. > > > This story doesn't seem quite right/complete This code is only invoked > if the monitor is the leader, which certainly doesn't happen when the > monitor is out of quorum. Is the beacon maybe going to a peon which > isn't forwarding quickly enough? no, the this code is also invoked when the mon is the peon, see ``` bool PaxosService::dispatch(MonOpRequestRef op) ... // make sure our map is readable and up to date if (!is_readable(m->version)) { dout(10) << " waiting for paxos -> readable (v" << m->version << ")" << dendl; wait_for_readable(op, new C_RetryMessage(this, op), m->version); return true; } ``` like John pointed out, it's being hold by peon which is waiting for a readable store. i just updated http://tracker.ceph.com/issues/19706#note-7 with my finding: the peon took more than 5 seconds to apply the transaction. this just does not make sense. > >> >> >> I could hack around this by only doing timeouts if *any* daemon has >> successfully got a beacon through in the last (mds_beacon_grace*2) or >> something like that, but I wonder if there's a Right way to handle >> this for PaxosService subclasses? > > > The MDS is I think the only thing doing this, so if its patterns > aren't working we probably don't. (The manager just started doing it, > so may have a similar problem?) > I suppose you might take into account the paxos quorum timeouts and > which monitor the MDS was connected to, so that it only marks an MDS > down if you have positive belief the monitor it was pinging is alive? > -Greg > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards Kefu Chai -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html