On Thu, 2011-03-03 at 11:51 -0700, Sage Weil wrote: > On Thu, 3 Mar 2011, Jim Schutt wrote: > > On Thu, 2011-03-03 at 11:04 -0700, Sage Weil wrote: > > > On Thu, 3 Mar 2011, Jim Schutt wrote: > > > > > > > > On Wed, 2011-03-02 at 22:03 -0700, Sage Weil wrote: > > > > > > I'm not sure how to track down what's happening here... > > > > > > > > > > Hmm. I'm not able to reproduce this here (tho I only have ~15 nodes > > > > > available at the moment). Seeing the last bit of the logs on the crashed > > > > > nodes will help. > > > > > > > > > > > Can you confirm that the chdir is working now? Maybe put an assert(0) in > > > tick() so we can verify core dumps are working in general? > > > > Great idea, and chdir is definitely working; got 96 core > > files as expected. > > Can you put an assert(0) at the top of OSD::shutdown() so we can verify > that the OSD isn't trying to shut itself down cleanly? (There are a few > cases where it might do that.) The logs you had make it look a bit like > that could be the case. Or that it is crashing in an unpleasant way in > the messenger pipe teardown. No luck there. Dead OSDs, but no core files. FWIW, I've got a patch for init-ceph that lets me run every daemon instance under valgrind and log its output to a separate file. I could try that if you think it might be useful. Things run pretty slowly that way, so if there's other testing you'd like me to try I should do it first. -- Jim > > Thanks! > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html