On Thu, Apr 28, 2016 at 04:32:51PM +0200, Lars Marowsky-Bree wrote: > On 2016-04-22T12:24:52, Sage Weil <sweil@xxxxxxxxxx> wrote: > > > Piotr has a PR at > > > > https://github.com/ceph/ceph/pull/8558 > > > > that changes the messenger and OSD logic so that if we get an ECONNREFUSED > > trying to talk to another OSD we can definitively conclude that the OSD is > > down/failed, without waiting for the normal heartbeat timeout. > > > > I think this is true in normal networking environments. My only concern > > is that there might be cases where the OSD isn't actually down and some > > transient network issue could cause ECONNREFUSED. Like... some > > firewally magic networky thing. If a transient ECONNREFUSED was possible, > > it could cause some ugly flapping. > > > > Can anyone think of something that might cause this? Even if it is > > something obscure, it means we should have a config option to disable this > > new behavior (we probably should anyway). > > Exactly this - the system reconfiguring it's network interfaces and > firewall rules (in a suboptimal fashion; it should drop, not reject, but > ...). I'm not convinced that we should care about this. I think that probability of (re)connect event occurrence during firewall reconfiguration is quite low. > Or a duplicate IP address (with a node that isn't running ceph-osd). > Again, not supposed to happen. That will cause a lot of other things to fail, and having ceph-osd get downed faster gives a greater chance of getting someone's attention. -- Piotr Dałek branch@xxxxxxxxxxxxxxxx http://blog.predictor.org.pl -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html