Re: Full cluster outage when ECONNREFUSED is triggered

Denis Krienbühl <denis@xxxxxxx> · Fri, 24 Nov 2023 14:38:48 +0100

Hi Frank.

> On 24 Nov 2023, at 14:27, Frank Schilder <frans@xxxxxx> wrote:
> 
> I have to ask a clarifying question. If I understand the intend of osd_fast_fail_on_connection_refused correctly, an OSD that receives a connection_refused should get marked down fast to avoid unnecessarily long wait times. And *only* OSDs that receive connection refused.
> 
> In your case, did booting up the server actually create a network route for all other OSDs to the wrong network as well? In other words, did it act as a gateway and all OSDs received connection refused messages and not just the ones on the critical host? If so, your observation would be expected. If not, then there is something wrong with the down reporting that should be looked at.

No, the server has two networks through which to reach OSDs and mons. Say north and south. South was down and the traffic destined to it made it through the default gateway to an unrelated host that would bounce everything with “connection refused”.

North was still up, and through it the other OSDs and mons could also be reached.

So the host that was bootet had the wrong configuration.

The packets on the other hosts of the cluster were unaffected and all their network configuration remained as is, though they would not have reached the OSDs on the booted host via south anymore. Those would have been dropped by my understanding.

I’ll be sure to create a detailed ticket and to post it to this thread, I’m just not sure I’ll be able to do it today, but after what I’ve heard I think this should at least be looked at in detail and I’ll be sure to provide as much info as I can.

Denis
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx