Re: cosd multi-second stalls cause "wrongly marked me down"

Jim Schutt <jaschut@xxxxxxxxxx> · Thu, 31 Mar 2011 11:10:11 -0600

Jim Schutt wrote:
Sage Weil wrote:
On Thu, 31 Mar 2011, Jim Schutt wrote:
I was actually suggesting we try to make it core dump inside the 
"delete
this" and watching for a stall in progress and then sending SIGABRT 
to dump
core in the act.  That way we verify it really is in the allocator (and
maybe even see where).  That's a bit harder to set up, though!  
Right, I couldn't think of how to automate that stall detection
during the stall, rather than after.  At least, I couldn't
think of how to do it without incurring possibly excessive
overhead, say by starting a timer on every "delete this".

Yeah.  I wonder if dumping core on a cosd right when it gets marked 
down would do the trick?  That should catch it ~20 seconds or whatever 
in the stall.  By watching for the "osdfoo marked down" messages from 
ceph -w?

What about making Cond::Wait() use pthread_cond_timedwait()
with a suitable timeout value, say 10 seconds, and asserting
on timeout?  Do you think there would be many legitimate 10
second delays in OSD processing?

Or, I could make a Cond::WaitIntervalOrAbort(), and
use it just on the pipe lock, since that's the source
of the trouble.  Sound useful?

-- Jim
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html