On Thu, 31 Mar 2011, Jim Schutt wrote: > Sage Weil wrote: > > On Thu, 31 Mar 2011, Jim Schutt wrote: > > > Jim Schutt wrote: > > > > Sage Weil wrote: > > > > > On Thu, 31 Mar 2011, Jim Schutt wrote: > > > > > > > I was actually suggesting we try to make it core dump inside the > > > > > > > "delete > > > > > > > this" and watching for a stall in progress and then sending > > > > > > > SIGABRT to > > > > > > > dump > > > > > > > core in the act. That way we verify it really is in the allocator > > > > > > > (and > > > > > > > maybe even see where). That's a bit harder to set up, though! > > > > > > Right, I couldn't think of how to automate that stall detection > > > > > > during the stall, rather than after. At least, I couldn't > > > > > > think of how to do it without incurring possibly excessive > > > > > > overhead, say by starting a timer on every "delete this". > > > > > Yeah. I wonder if dumping core on a cosd right when it gets marked > > > > > down > > > > > would do the trick? That should catch it ~20 seconds or whatever in > > > > > the > > > > > stall. By watching for the "osdfoo marked down" messages from ceph > > > > > -w? > > > > What about making Cond::Wait() use pthread_cond_timedwait() > > > > with a suitable timeout value, say 10 seconds, and asserting > > > > on timeout? Do you think there would be many legitimate 10 > > > > second delays in OSD processing? > > > > > > > Or, I could make a Cond::WaitIntervalOrAbort(), and > > > use it just on the pipe lock, since that's the source > > > of the trouble. Sound useful? > > > > Yeah that sounds like the way to go.. then you can hand pick the site(s) > > that is/are waiting a long time in this case and switch those to > > WaitIntervalOrAbort? Hopefully the cond timer will go off despite whatever > > badness is going on in delete this... > > Actually, it occurs to me Wait() isn't what I'm after: > that is used to wait some unknown time for some event. > > I think instead I need to use TryLock() on the pipe_lock > in submit_message(), in a loop with a suitable sleep, > say 100us, and assert when it takes too long to acquire > the lock. > > So, maybe add a Mutex::LockOrAbort(), and use it in > submit_message()? > > submit_message() is intended to return immediately, no? > And the issue is caused by heartbeat() being unable to > queue messages, so this sounds to me to be a useful > test. > > Does that seem to have low enough overhead to > be useful? Yeah, that sounds right! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html