On Wednesday, February 23, 2011 at 10:54 AM, Sage Weil wrote: > On Wed, 23 Feb 2011, Gregory Farnum wrote: > > I have managed to get OSDs wrongly marking each other down during > > startup when they're peering large numbers of PGs/pools, as they > > disagree on who they need to be heartbeating (due to the slow handling > > of new osd maps and pg creates); if you're mostly seeing OSDs get > > incorrectly marked down during low epochs (your original email said > > epoch 7) this is probably what you're finding. > > FWIW, this isn't supposed to happen either.. the implementation may be > broken somewhat. The idea is that once an OSD starts to expect a > heartbeat it should tell them so. And if an OSD is told that a future > epoch says it should send heartbeats to node foo, then it will do so, at > least until it processes that epoch. Hmmm -- I don't think they're telling the other OSDs that they're heartbeat partners! At least I didn't see anything that would make that happen. They just start expecting pings, and in some cases they will start sending them because they notice they're a local replica too, but there's nothing in those messages like "you owe me pings as of epoch x". Are there stubs you know of that I should look at in re-implementing this behavior? > > We still have no idea what could be causing the stall *inside* of > > tick(), though. :/ > > You mean heartbeat(), right? Yep, still no clue... :( > Well the 28-second stall is inside of tick() as it arms a timer for the next tick. Heartbeat is definitely failing but nobody's quite sure why, as I recall. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html