Re: cosd multi-second stalls cause "wrongly marked me down"

"Jim Schutt" <jaschut@xxxxxxxxxx> · Thu, 3 Mar 2011 12:39:44 -0700

On Thu, 2011-03-03 at 11:51 -0700, Sage Weil wrote:
> On Thu, 3 Mar 2011, Jim Schutt wrote:
> > On Thu, 2011-03-03 at 11:04 -0700, Sage Weil wrote:
> > > On Thu, 3 Mar 2011, Jim Schutt wrote:
> > > > 
> > > > On Wed, 2011-03-02 at 22:03 -0700, Sage Weil wrote:
> > > > > > I'm not sure how to track down what's happening here...
> > > > > 
> > > > > Hmm.  I'm not able to reproduce this here (tho I only have ~15 nodes 
> > > > > available at the moment).  Seeing the last bit of the logs on the crashed 
> > > > > nodes will help.
> > > > > 
> > > 
> > > Can you confirm that the chdir is working now?  Maybe put an assert(0) in 
> > > tick() so we can verify core dumps are working in general?
> > 
> > Great idea, and chdir is definitely working; got 96 core 
> > files as expected.
> 
> Can you put an assert(0) at the top of OSD::shutdown() so we can verify 
> that the OSD isn't trying to shut itself down cleanly?  (There are a few 
> cases where it might do that.)  The logs you had make it look a bit like 
> that could be the case.  Or that it is crashing in an unpleasant way in 
> the messenger pipe teardown.

No luck there.  Dead OSDs, but no core files.

FWIW, I've got a patch for init-ceph that lets
me run every daemon instance under valgrind and
log its output to a separate file.  I could try 
that if you think it might be useful.

Things run pretty slowly that way, so if there's
other testing you'd like me to try I should do
it first.

-- Jim

> 
> Thanks!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html