Re: cosd multi-second stalls cause "wrongly marked me down"

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 3 Mar 2011 10:51:52 -0800 (PST)

On Thu, 3 Mar 2011, Jim Schutt wrote:
> On Thu, 2011-03-03 at 11:04 -0700, Sage Weil wrote:
> > On Thu, 3 Mar 2011, Jim Schutt wrote:
> > > 
> > > On Wed, 2011-03-02 at 22:03 -0700, Sage Weil wrote:
> > > > > I'm not sure how to track down what's happening here...
> > > > 
> > > > Hmm.  I'm not able to reproduce this here (tho I only have ~15 nodes 
> > > > available at the moment).  Seeing the last bit of the logs on the crashed 
> > > > nodes will help.
> > > > 
> > 
> > Can you confirm that the chdir is working now?  Maybe put an assert(0) in 
> > tick() so we can verify core dumps are working in general?
> 
> Great idea, and chdir is definitely working; got 96 core 
> files as expected.

Can you put an assert(0) at the top of OSD::shutdown() so we can verify 
that the OSD isn't trying to shut itself down cleanly?  (There are a few 
cases where it might do that.)  The logs you had make it look a bit like 
that could be the case.  Or that it is crashing in an unpleasant way in 
the messenger pipe teardown.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html