Re: ECONNREFUSED implies OSD definitely failed

Piotr Dałek <branch@xxxxxxxxxxxxxxxx> · Fri, 29 Apr 2016 09:46:39 +0200

On Thu, Apr 28, 2016 at 04:32:51PM +0200, Lars Marowsky-Bree wrote:
> On 2016-04-22T12:24:52, Sage Weil <sweil@xxxxxxxxxx> wrote:
> 
> > Piotr has a PR at
> > 
> > 	https://github.com/ceph/ceph/pull/8558
> > 
> > that changes the messenger and OSD logic so that if we get an ECONNREFUSED 
> > trying to talk to another OSD we can definitively conclude that the OSD is 
> > down/failed, without waiting for the normal heartbeat timeout.
> > 
> > I think this is true in normal networking environments.  My only concern 
> > is that there might be cases where the OSD isn't actually down and some 
> > transient network issue could cause ECONNREFUSED.  Like... some 
> > firewally magic networky thing.  If a transient ECONNREFUSED was possible, 
> > it could cause some ugly flapping.
> > 
> > Can anyone think of something that might cause this?  Even if it is 
> > something obscure, it means we should have a config option to disable this 
> > new behavior (we probably should anyway).
> 
> Exactly this - the system reconfiguring it's network interfaces and
> firewall rules (in a suboptimal fashion; it should drop, not reject, but
> ...).

I'm not convinced that we should care about this. I think that probability
of (re)connect event occurrence during firewall reconfiguration is quite
low.

> Or a duplicate IP address (with a node that isn't running ceph-osd).
> Again, not supposed to happen.

That will cause a lot of other things to fail, and having ceph-osd get
downed faster gives a greater chance of getting someone's attention. 

-- 
Piotr Dałek
branch@xxxxxxxxxxxxxxxx
http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html