Re: ECONNREFUSED implies OSD definitely failed

Lars Marowsky-Bree <lmb@xxxxxxxx> · Thu, 28 Apr 2016 16:32:51 +0200

On 2016-04-22T12:24:52, Sage Weil <sweil@xxxxxxxxxx> wrote:

> Piotr has a PR at
> 
> 	https://github.com/ceph/ceph/pull/8558
> 
> that changes the messenger and OSD logic so that if we get an ECONNREFUSED 
> trying to talk to another OSD we can definitively conclude that the OSD is 
> down/failed, without waiting for the normal heartbeat timeout.
> 
> I think this is true in normal networking environments.  My only concern 
> is that there might be cases where the OSD isn't actually down and some 
> transient network issue could cause ECONNREFUSED.  Like... some 
> firewally magic networky thing.  If a transient ECONNREFUSED was possible, 
> it could cause some ugly flapping.
> 
> Can anyone think of something that might cause this?  Even if it is 
> something obscure, it means we should have a config option to disable this 
> new behavior (we probably should anyway).

Exactly this - the system reconfiguring it's network interfaces and
firewall rules (in a suboptimal fashion; it should drop, not reject, but
...).

Or a duplicate IP address (with a node that isn't running ceph-osd).
Again, not supposed to happen.

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html