Re: ECONNREFUSED implies OSD definitely failed

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 29 Apr 2016 08:29:59 -0400 (EDT)

On Fri, 29 Apr 2016, Piotr Dałek wrote:
> On Thu, Apr 28, 2016 at 04:32:51PM +0200, Lars Marowsky-Bree wrote:
> > On 2016-04-22T12:24:52, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > 
> > > Piotr has a PR at
> > > 
> > > 	https://github.com/ceph/ceph/pull/8558
> > > 
> > > that changes the messenger and OSD logic so that if we get an ECONNREFUSED 
> > > trying to talk to another OSD we can definitively conclude that the OSD is 
> > > down/failed, without waiting for the normal heartbeat timeout.
> > > 
> > > I think this is true in normal networking environments.  My only concern 
> > > is that there might be cases where the OSD isn't actually down and some 
> > > transient network issue could cause ECONNREFUSED.  Like... some 
> > > firewally magic networky thing.  If a transient ECONNREFUSED was possible, 
> > > it could cause some ugly flapping.
> > > 
> > > Can anyone think of something that might cause this?  Even if it is 
> > > something obscure, it means we should have a config option to disable this 
> > > new behavior (we probably should anyway).
> > 
> > Exactly this - the system reconfiguring it's network interfaces and
> > firewall rules (in a suboptimal fashion; it should drop, not reject, but
> > ...).
> 
> I'm not convinced that we should care about this. I think that probability
> of (re)connect event occurrence during firewall reconfiguration is quite
> low.

Yeah, I tend to agree.

Let's just add a config option to control the new behavior so that if, for 
some reason, there is an environment where this does happen the fast-fail 
can be disabled.

sage