Re: cosd multi-second stalls cause "wrongly marked me down"

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 31 Mar 2011 10:24:05 -0700 (PDT)

On Thu, 31 Mar 2011, Jim Schutt wrote:
> Jim Schutt wrote:
> > Sage Weil wrote:
> > > On Thu, 31 Mar 2011, Jim Schutt wrote:
> > > > > I was actually suggesting we try to make it core dump inside the
> > > > > "delete
> > > > > this" and watching for a stall in progress and then sending SIGABRT to
> > > > > dump
> > > > > core in the act.  That way we verify it really is in the allocator
> > > > > (and
> > > > > maybe even see where).  That's a bit harder to set up, though!  
> > > > Right, I couldn't think of how to automate that stall detection
> > > > during the stall, rather than after.  At least, I couldn't
> > > > think of how to do it without incurring possibly excessive
> > > > overhead, say by starting a timer on every "delete this".
> > > 
> > > Yeah.  I wonder if dumping core on a cosd right when it gets marked down
> > > would do the trick?  That should catch it ~20 seconds or whatever in the
> > > stall.  By watching for the "osdfoo marked down" messages from ceph -w?
> > 
> > What about making Cond::Wait() use pthread_cond_timedwait()
> > with a suitable timeout value, say 10 seconds, and asserting
> > on timeout?  Do you think there would be many legitimate 10
> > second delays in OSD processing?
> > 
> 
> Or, I could make a Cond::WaitIntervalOrAbort(), and
> use it just on the pipe lock, since that's the source
> of the trouble.  Sound useful?

Yeah that sounds like the way to go.. then you can hand pick the site(s) 
that is/are waiting a long time in this case and switch those to 
WaitIntervalOrAbort?  Hopefully the cond timer will go off despite 
whatever badness is going on in delete this...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html