Re: ECONNREFUSED implies OSD definitely failed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2016-04-22T12:24:52, Sage Weil <sweil@xxxxxxxxxx> wrote:

> Piotr has a PR at
> 
> 	https://github.com/ceph/ceph/pull/8558
> 
> that changes the messenger and OSD logic so that if we get an ECONNREFUSED 
> trying to talk to another OSD we can definitively conclude that the OSD is 
> down/failed, without waiting for the normal heartbeat timeout.
> 
> I think this is true in normal networking environments.  My only concern 
> is that there might be cases where the OSD isn't actually down and some 
> transient network issue could cause ECONNREFUSED.  Like... some 
> firewally magic networky thing.  If a transient ECONNREFUSED was possible, 
> it could cause some ugly flapping.
> 
> Can anyone think of something that might cause this?  Even if it is 
> something obscure, it means we should have a config option to disable this 
> new behavior (we probably should anyway).

Exactly this - the system reconfiguring it's network interfaces and
firewall rules (in a suboptimal fashion; it should drop, not reject, but
...).

Or a duplicate IP address (with a node that isn't running ceph-osd).
Again, not supposed to happen.



-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux