Re: Full cluster outage when ECONNREFUSED is triggered

Janne Johansson <icepic.dz@xxxxxxxxx> · Fri, 24 Nov 2023 11:09:30 +0100

Den fre 24 nov. 2023 kl 10:25 skrev Frank Schilder <frans@xxxxxx>:
>
> Hi Denis,
>
> I would agree with you that a single misconfigured host should not take out healthy hosts under any circumstances. I'm not sure if your incident is actually covered by the devs comments, it is quite possible that you observed an unintended side effect that is a bug in handling the connection error. I think the intention is to shut down fast the OSDs with connection refused (where timeouts are not required) and not other OSDs.

No, this has been true for a long while and has happened to me
multiple times. Any kind of fault where an OSD can talk to at least
one mon, and then for any of multiple reasons do not respond or
connect to other OSDs lead to this. The OSD comes up, can hold 0 data
and have a lifetime of 5 seconds, and then it claims it can't talk to
some 5-10-15 other OSDs, and rats on them to the mon which then
proceeds to believe this new OSD and flaps those other OSDs, which
then get to reconnect and the mon logs that "osdmap says I am down but
I am not". This goes on until you kill the bad OSD or fix the network.
If you boot up a host with many OSDs, they all pick 5-10-15 OSDs to
shoot down, so you quickly get annoying errors at a large scale if you
ever misconfigure a new host in the least possible way.

There should be some better kind of validation that the new (and in my
case at least, often empty) OSD is not at fault and that the other
hundreds of working OSDs are in fact not gone at all before causing
this kind of confusion.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx