Re: Full cluster outage when ECONNREFUSED is triggered

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On 24 Nov 2023, at 11:49, Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> 
> This should not be case in the reported situation unless setting osd_fast_fail_on_connection_refused<https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_fast_fail_on_connection_refused>=true changes this behaviour.


In our tests it does change the behavior. Usually the mons take mon_osd_reporter_subtree_level and mon_osd_min_down_reporters into account. In our tests, this is the case if an OSD heartbeat is dropped and the OSD is still able to talk to the mons.

However, if the OSD heartbeat is rejected, in our case because of an unrelated firewall change, the OSD sends an immediate failure to the mon:
https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/osd/OSD.cc#L6434;
ceph/src/osd/OSD.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph
github.com


The mon then propagates that failure, without taking any other reports into consideration:

https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/mon/OSDMonitor.cc#L3367;
ceph/src/mon/OSDMonitor.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph
github.com

This is fine when a single OSD goes down and everything else is okay. It then has the intended effect of getting rid of the OSD fast. The assumption presumably being: If a host can answer with a rejection to the OSD heartbeat, it is only the OSD that is affected.

In our case however, a network change caused rejections from an entirely different host (a gateway), while a network path to the mons was still available. In this case, Ceph does not apply the safe-guards it usually does.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux