Re: cosd multi-second stalls cause "wrongly marked me down"

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Wed, 23 Feb 2011 12:27:17 -0800



On Wednesday, February 23, 2011 at 11:23 AM, Jim Schutt wrote:
> > I have managed to get OSDs wrongly marking each other down during startup when they're peering large numbers of PGs/pools, as they disagree on who they need to be heartbeating (due to the slow handling of new osd maps and pg creates); if you're mostly seeing OSDs get incorrectly marked down during low epochs (your original email said epoch 7) this is probably what you're finding. 
> 
> What I've been trying to look for is heartbeat stalls after I 
> start up a bunch of clients writing. I'm really not sure why that
> original log caught one at such an early epoch - maybe there's
> two things going on?
> 
That wouldn't surprise me too much, but is something to keep in mind when observing. :)

> > We still have no idea what could be causing the stall *inside* of tick(), though. :/
> 
> I think that one was just lucky. Most of the stalls I've
> collected are between ticks.
Stalls between ticks make a lot of sense, since tick requires the osd_lock and we have some functions holding it for way too long, but as far as we can tell a stalled tick() function shouldn't break anything -- heartbeats are sent independently, and all the processing of heartbeats (where you detect down OSDs) is done inside of tick in such a way that it's not going to lose delivery of heartbeats -- that shouldn't be a problem!


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html