Re: Full cluster outage when ECONNREFUSED is triggered

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,


I think this is why the mon-osd interaction requires a certain number of osd to report another osd as down/unavailable:

https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/#osds-report-down-osds


The default value for mon_osd_reporter_subtree_level is host, and the default value for mon_osd_min_down_reporters is 2 (values taken from one of our cluster, no override in ceph config). So it requires reports from two osds in different hosts to consider another osd as down. This should not be case in the reported situation unless setting osd_fast_fail_on_connection_refused<https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_fast_fail_on_connection_refused>=true changes this behaviour.


Best regards,

Burkhard Linke

On 24.11.23 11:09, Janne Johansson wrote:
Den fre 24 nov. 2023 kl 10:25 skrev Frank Schilder<frans@xxxxxx>:
Hi Denis,

I would agree with you that a single misconfigured host should not take out healthy hosts under any circumstances. I'm not sure if your incident is actually covered by the devs comments, it is quite possible that you observed an unintended side effect that is a bug in handling the connection error. I think the intention is to shut down fast the OSDs with connection refused (where timeouts are not required) and not other OSDs.
No, this has been true for a long while and has happened to me
multiple times. Any kind of fault where an OSD can talk to at least
one mon, and then for any of multiple reasons do not respond or
connect to other OSDs lead to this. The OSD comes up, can hold 0 data
and have a lifetime of 5 seconds, and then it claims it can't talk to
some 5-10-15 other OSDs, and rats on them to the mon which then
proceeds to believe this new OSD and flaps those other OSDs, which
then get to reconnect and the mon logs that "osdmap says I am down but
I am not". This goes on until you kill the bad OSD or fix the network.
If you boot up a host with many OSDs, they all pick 5-10-15 OSDs to
shoot down, so you quickly get annoying errors at a large scale if you
ever misconfigure a new host in the least possible way.

There should be some better kind of validation that the new (and in my
case at least, often empty) OSD is not at fault and that the other
hundreds of working OSDs are in fact not gone at all before causing
this kind of confusion.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux