Piotr has a PR at https://github.com/ceph/ceph/pull/8558 that changes the messenger and OSD logic so that if we get an ECONNREFUSED trying to talk to another OSD we can definitively conclude that the OSD is down/failed, without waiting for the normal heartbeat timeout. I think this is true in normal networking environments. My only concern is that there might be cases where the OSD isn't actually down and some transient network issue could cause ECONNREFUSED. Like... some firewally magic networky thing. If a transient ECONNREFUSED was possible, it could cause some ugly flapping. Can anyone think of something that might cause this? Even if it is something obscure, it means we should have a config option to disable this new behavior (we probably should anyway). sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html