Re: cosd multi-second stalls cause "wrongly marked me down"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jim Schutt wrote:
Sage Weil wrote:
On Thu, 31 Mar 2011, Jim Schutt wrote:
I was actually suggesting we try to make it core dump inside the "delete this" and watching for a stall in progress and then sending SIGABRT to dump
core in the act.  That way we verify it really is in the allocator (and
maybe even see where). That's a bit harder to set up, though!
Right, I couldn't think of how to automate that stall detection
during the stall, rather than after.  At least, I couldn't
think of how to do it without incurring possibly excessive
overhead, say by starting a timer on every "delete this".

Yeah. I wonder if dumping core on a cosd right when it gets marked down would do the trick? That should catch it ~20 seconds or whatever in the stall. By watching for the "osdfoo marked down" messages from ceph -w?

What about making Cond::Wait() use pthread_cond_timedwait()
with a suitable timeout value, say 10 seconds, and asserting
on timeout?  Do you think there would be many legitimate 10
second delays in OSD processing?


Or, I could make a Cond::WaitIntervalOrAbort(), and
use it just on the pipe lock, since that's the source
of the trouble.  Sound useful?

-- Jim
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux