On Wed, 2 Mar 2011, Jim Schutt wrote: > On Tue, 2011-03-01 at 17:53 -0700, Sage Weil wrote: > > Hi Jim, > > > > We've fixed a few different bugs over the last week that were causing > > heartbeat issues. > > Great! > > > Nothing that explains why we would see the hang that > > you did, but other problems that caused the same 'wrongly marked me down' > > issue. Are you still seeing this problem with the latest 'next' and/or > > 'master' branch? > > I've been trying to isolate this on the stable branch > since my last posting - I can still reproduce at will > with my 96 osd test, but I haven't made much progress > at tracking down what is going wrong. > > > > > Also, if you don't mind reproducing, can you post a larger segment of the > > log? > > Sure. I've got some extra debug printing going in > my tree - the most interesting is a patch to log > queue, operation, and total elapsed times in > dispatch_entry() - it makes is really easy to > find when things go wrong. > > I'll try to reproduce with master and post logs. > Is it OK for me to add my extra debug patches for > that? I'll post them with the logs if so. Absolutely. > > The really interesting question is what the heartbeat thread > > (heartbeat_entry()) is doing during this period that tick() is blocked up, > > since that's the thread that's responsible for sending the ping messages > > to peer OSDs. > > One of the things I am seeing is handle_osd_ping() > getting stalled, but I haven't been able to track > down why. > > I'll see if I see the same signature with master, > and post logs. Thanks! Keep us posted. sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html