Den fre 24 nov. 2023 kl 10:25 skrev Frank Schilder <frans@xxxxxx>: > > Hi Denis, > > I would agree with you that a single misconfigured host should not take out healthy hosts under any circumstances. I'm not sure if your incident is actually covered by the devs comments, it is quite possible that you observed an unintended side effect that is a bug in handling the connection error. I think the intention is to shut down fast the OSDs with connection refused (where timeouts are not required) and not other OSDs. No, this has been true for a long while and has happened to me multiple times. Any kind of fault where an OSD can talk to at least one mon, and then for any of multiple reasons do not respond or connect to other OSDs lead to this. The OSD comes up, can hold 0 data and have a lifetime of 5 seconds, and then it claims it can't talk to some 5-10-15 other OSDs, and rats on them to the mon which then proceeds to believe this new OSD and flaps those other OSDs, which then get to reconnect and the mon logs that "osdmap says I am down but I am not". This goes on until you kill the bad OSD or fix the network. If you boot up a host with many OSDs, they all pick 5-10-15 OSDs to shoot down, so you quickly get annoying errors at a large scale if you ever misconfigure a new host in the least possible way. There should be some better kind of validation that the new (and in my case at least, often empty) OSD is not at fault and that the other hundreds of working OSDs are in fact not gone at all before causing this kind of confusion. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx