On Thu, 31 Mar 2011, Jim Schutt wrote: > Jim Schutt wrote: > > Sage Weil wrote: > > > On Thu, 31 Mar 2011, Jim Schutt wrote: > > > > > I was actually suggesting we try to make it core dump inside the > > > > > "delete > > > > > this" and watching for a stall in progress and then sending SIGABRT to > > > > > dump > > > > > core in the act. That way we verify it really is in the allocator > > > > > (and > > > > > maybe even see where). That's a bit harder to set up, though! > > > > Right, I couldn't think of how to automate that stall detection > > > > during the stall, rather than after. At least, I couldn't > > > > think of how to do it without incurring possibly excessive > > > > overhead, say by starting a timer on every "delete this". > > > > > > Yeah. I wonder if dumping core on a cosd right when it gets marked down > > > would do the trick? That should catch it ~20 seconds or whatever in the > > > stall. By watching for the "osdfoo marked down" messages from ceph -w? > > > > What about making Cond::Wait() use pthread_cond_timedwait() > > with a suitable timeout value, say 10 seconds, and asserting > > on timeout? Do you think there would be many legitimate 10 > > second delays in OSD processing? > > > > Or, I could make a Cond::WaitIntervalOrAbort(), and > use it just on the pipe lock, since that's the source > of the trouble. Sound useful? Yeah that sounds like the way to go.. then you can hand pick the site(s) that is/are waiting a long time in this case and switch those to WaitIntervalOrAbort? Hopefully the cond timer will go off despite whatever badness is going on in delete this... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html