On Wednesday, February 23, 2011 at 11:23 AM, Jim Schutt wrote: > > I have managed to get OSDs wrongly marking each other down during startup when they're peering large numbers of PGs/pools, as they disagree on who they need to be heartbeating (due to the slow handling of new osd maps and pg creates); if you're mostly seeing OSDs get incorrectly marked down during low epochs (your original email said epoch 7) this is probably what you're finding. > > What I've been trying to look for is heartbeat stalls after I > start up a bunch of clients writing. I'm really not sure why that > original log caught one at such an early epoch - maybe there's > two things going on? > That wouldn't surprise me too much, but is something to keep in mind when observing. :) > > We still have no idea what could be causing the stall *inside* of tick(), though. :/ > > I think that one was just lucky. Most of the stalls I've > collected are between ticks. Stalls between ticks make a lot of sense, since tick requires the osd_lock and we have some functions holding it for way too long, but as far as we can tell a stalled tick() function shouldn't break anything -- heartbeats are sent independently, and all the processing of heartbeats (where you detect down OSDs) is done inside of tick in such a way that it's not going to lose delivery of heartbeats -- that shouldn't be a problem! -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html