Re: cosd multi-second stalls cause "wrongly marked me down"

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 31 Mar 2011 09:25:05 -0700 (PDT)

On Thu, 31 Mar 2011, Jim Schutt wrote:
> > I was actually suggesting we try to make it core dump inside the "delete
> > this" and watching for a stall in progress and then sending SIGABRT to dump
> > core in the act.  That way we verify it really is in the allocator (and
> > maybe even see where).  That's a bit harder to set up, though!  
> 
> Right, I couldn't think of how to automate that stall detection
> during the stall, rather than after.  At least, I couldn't
> think of how to do it without incurring possibly excessive
> overhead, say by starting a timer on every "delete this".

Yeah.  I wonder if dumping core on a cosd right when it gets marked down 
would do the trick?  That should catch it ~20 seconds or whatever in the 
stall.  By watching for the "osdfoo marked down" messages from ceph -w?

> > Dumping right after may still yield some useful info, but I'm less
> > hopeful...
> 
> I thought I might try turning off all debugging, except a notice
> that the "delete this" took too long.  This is easy to do, and
> would tell us if allocator activity in support of debugging is
> affecting operations.  It doesn't lead to any ideas for
> improving the situation, though :/
> 
> Also, since I built tcmalloc from source, I thought I might
> try to figure out what operation is taking too long there.
> I'm hoping Ceph logging redirection is set up so that stdout
> or stderr from tcmalloc would show up in my log files?

Not with the default logging stuff.  However, you can run the daemons with 
'-d' and they will stay in the foreground and log to stderr.  Or -f will 
send the ceph logs to their usual locations, but the daemon won't fork and 
you can redirect stdout/stderr (with any tcmalloc stuff) wherever you 
like.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html