Re: cosd multi-second stalls cause "wrongly marked me down"

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 23 Feb 2011 10:54:42 -0800 (PST)

On Wed, 23 Feb 2011, Gregory Farnum wrote:
> 
> On Wednesday, February 23, 2011 at 9:52 AM, Jim Schutt wrote: 
> > 
> > On Thu, 2011-02-17 at 09:11 -0700, Sage Weil wrote:
> > > On Thu, 17 Feb 2011, Jim Schutt wrote:
> > > > Hi Sage,
> > > > 
> > > > On Wed, 2011-02-16 at 17:54 -0700, Sage Weil wrote:
> > > > > On Wed, 16 Feb 2011, Sage Weil wrote:
> > > > > > shouldn't affect anything. We may have missed something.. do you have a 
> > > > > > log showing this in action?
> > > > > 
> > > > > Obviously yes, looking at your original email. :) At the beginning of 
> > > > > each log line we include a thread id. What would be really helpful would 
> > > > > be to narrow down where in OSD::heartbeat_entry() and heartbeat() things 
> > > > > are blocking, either based on the existing output, or by adding additional 
> > > > > dout lines at interesting points in time.
> > > > 
> > > > I'll take a deeper look at my existing logs with
> > > > that in mind; let me know if you'd like me to
> > > > send you some.
> > > > 
> > > > I have also been looking at map_lock, as it seems
> > > > to be shared between the heartbeat and map update
> > > > threads.
> > > > 
> > > > Would instrumenting acquiring/releasing that lock
> > > > be helpful? Is there some other lock that may
> > > > be more fruitful to instrument? I can reproduce 
> > > > pretty reliably, so adding instrumentation is 
> > > > no problem.
> > > 
> > > The heartbeat thread is doing a map_lock.try_get_read() because it 
> > > frequently is held by another thread, so that shouldn't ever block. 
> > > 
> > > The possibilities I see are:
> > >  - peer_stat_lock
> > >  - the monc->sub_want / renew_subs calls (monc has an internal lock), 
> > > although that code should only trigger with a single osd. :/
> > >  - heartbeat_lock itself could be held by another thread; i'd instrument 
> > > all locks/unlocks there, along with the wakeup in heartbeat().
> > 
> > I think the culprit is osd_lock.
> > 
> > The OSD tick timer needs to acquire the osd_lock mutex to
> > wake up, right? So if osd_lock is ever held for an excessive 
> > time, the tick will be delayed.
> > 
> > I used debug lockdep = 20 on my osds, and saw this via
> > egrep "10:07:(2[2-9]|3[0-3])" osd.70.log | egrep "tick|osd_lock" | grep -v "tick .*"
> > 
> > 2011-02-23 10:07:23.454954 0x7f51a17fa940 lockdep: _will_lock OSD::osd_lock (9)
> > 2011-02-23 10:07:23.455118 0x7f51a37fe940 lockdep: _will_unlock OSD::osd_lock
> > 2011-02-23 10:07:23.455254 0x7f51a37fe940 lockdep: _will_lock OSD::osd_lock (9)
> > 2011-02-23 10:07:23.455262 0x7f51a37fe940 lockdep: _locked OSD::osd_lock
> > 2011-02-23 10:07:33.008092 0x7f51a37fe940 lockdep: _will_unlock OSD::osd_lock
> > 2011-02-23 10:07:33.009135 0x7f51a17fa940 lockdep: _locked OSD::osd_lock
> > 2011-02-23 10:07:33.009217 0x7f51a17fa940 lockdep: _will_unlock OSD::osd_lock
> > 2011-02-23 10:07:33.009307 0x7f51ad89d940 osd70 149 tick
> > 
> > I've attached a full log of the 10 second period that osd_lock
> > was held.
> > 
> > Is it possible to separate out the parts of tick() that
> > need to covered by osd_lock and do them somewhere else, 
> > so that a new lock, say tick_lock, can be used to
> > protect the tick timer?
> > 
> > -- Jim
> > 
> 
> tick() isn't involved in heartbeats, actually, so any blocks there 
> should be unrelated to heartbeat bugs (and once you're in tick() or 
> functions it calls, there's no losing the heartbeat!).

Right.  I'm not sure tick is related to the problem?

> I have managed to get OSDs wrongly marking each other down during 
> startup when they're peering large numbers of PGs/pools, as they 
> disagree on who they need to be heartbeating (due to the slow handling 
> of new osd maps and pg creates); if you're mostly seeing OSDs get 
> incorrectly marked down during low epochs (your original email said 
> epoch 7) this is probably what you're finding.

FWIW, this isn't supposed to happen either.. the implementation may be 
broken somewhat.  The idea is that once an OSD starts to expect a 
heartbeat it should tell them so.  And if an OSD is told that a future 
epoch says it should send heartbeats to node foo, then it will do so, at 
least until it processes that epoch.

If you have logs reproducing this, let's take a closer look so we can fix 
it.  The implementation has probably drifted from the original 
design/intent... :)

> We still have no idea what could be causing the stall *inside* of 
> tick(), though. :/

You mean heartbeat(), right?  Yep, still no clue...  :(

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html