On Fri, 29 Apr 2016, Piotr Dałek wrote: > On Thu, Apr 28, 2016 at 04:32:51PM +0200, Lars Marowsky-Bree wrote: > > On 2016-04-22T12:24:52, Sage Weil <sweil@xxxxxxxxxx> wrote: > > > > > Piotr has a PR at > > > > > > https://github.com/ceph/ceph/pull/8558 > > > > > > that changes the messenger and OSD logic so that if we get an ECONNREFUSED > > > trying to talk to another OSD we can definitively conclude that the OSD is > > > down/failed, without waiting for the normal heartbeat timeout. > > > > > > I think this is true in normal networking environments. My only concern > > > is that there might be cases where the OSD isn't actually down and some > > > transient network issue could cause ECONNREFUSED. Like... some > > > firewally magic networky thing. If a transient ECONNREFUSED was possible, > > > it could cause some ugly flapping. > > > > > > Can anyone think of something that might cause this? Even if it is > > > something obscure, it means we should have a config option to disable this > > > new behavior (we probably should anyway). > > > > Exactly this - the system reconfiguring it's network interfaces and > > firewall rules (in a suboptimal fashion; it should drop, not reject, but > > ...). > > I'm not convinced that we should care about this. I think that probability > of (re)connect event occurrence during firewall reconfiguration is quite > low. Yeah, I tend to agree. Let's just add a config option to control the new behavior so that if, for some reason, there is an environment where this does happen the fast-fail can be disabled. sage