ECONNREFUSED implies OSD definitely failed

Sage Weil <sweil@xxxxxxxxxx> · Fri, 22 Apr 2016 12:24:52 -0400 (EDT)

Piotr has a PR at

	https://github.com/ceph/ceph/pull/8558

that changes the messenger and OSD logic so that if we get an ECONNREFUSED 
trying to talk to another OSD we can definitively conclude that the OSD is 
down/failed, without waiting for the normal heartbeat timeout.

I think this is true in normal networking environments.  My only concern 
is that there might be cases where the OSD isn't actually down and some 
transient network issue could cause ECONNREFUSED.  Like... some 
firewally magic networky thing.  If a transient ECONNREFUSED was possible, 
it could cause some ugly flapping.

Can anyone think of something that might cause this?  Even if it is 
something obscure, it means we should have a config option to disable this 
new behavior (we probably should anyway).

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html