On 2016-04-22T12:24:52, Sage Weil <sweil@xxxxxxxxxx> wrote: > Piotr has a PR at > > https://github.com/ceph/ceph/pull/8558 > > that changes the messenger and OSD logic so that if we get an ECONNREFUSED > trying to talk to another OSD we can definitively conclude that the OSD is > down/failed, without waiting for the normal heartbeat timeout. > > I think this is true in normal networking environments. My only concern > is that there might be cases where the OSD isn't actually down and some > transient network issue could cause ECONNREFUSED. Like... some > firewally magic networky thing. If a transient ECONNREFUSED was possible, > it could cause some ugly flapping. > > Can anyone think of something that might cause this? Even if it is > something obscure, it means we should have a config option to disable this > new behavior (we probably should anyway). Exactly this - the system reconfiguring it's network interfaces and firewall rules (in a suboptimal fashion; it should drop, not reject, but ...). Or a duplicate IP address (with a node that isn't running ceph-osd). Again, not supposed to happen. -- SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html