On Tue, 2011-03-01 at 17:53 -0700, Sage Weil wrote: > Hi Jim, > > We've fixed a few different bugs over the last week that were causing > heartbeat issues. Great! > Nothing that explains why we would see the hang that > you did, but other problems that caused the same 'wrongly marked me down' > issue. Are you still seeing this problem with the latest 'next' and/or > 'master' branch? I've been trying to isolate this on the stable branch since my last posting - I can still reproduce at will with my 96 osd test, but I haven't made much progress at tracking down what is going wrong. > > Also, if you don't mind reproducing, can you post a larger segment of the > log? Sure. I've got some extra debug printing going in my tree - the most interesting is a patch to log queue, operation, and total elapsed times in dispatch_entry() - it makes is really easy to find when things go wrong. I'll try to reproduce with master and post logs. Is it OK for me to add my extra debug patches for that? I'll post them with the logs if so. > The really interesting question is what the heartbeat thread > (heartbeat_entry()) is doing during this period that tick() is blocked up, > since that's the thread that's responsible for sending the ping messages > to peer OSDs. One of the things I am seeing is handle_osd_ping() getting stalled, but I haven't been able to track down why. I'll see if I see the same signature with master, and post logs. -- Jim > > Thanks! > sage > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html