Full cluster outage when ECONNREFUSED is triggered

Denis Krienbühl <denis@xxxxxxx> · Fri, 24 Nov 2023 09:01:05 +0100

Hi

We’ve recently had a serious outage at work, after a host had a network problem: 

- We rebooted a single host in a cluster of fifteen hosts across three racks.
- The single host had a bad network configuration after booting, causing it to send some packets to the wrong network.
- One network still worked and offered a connection to the mons.
- The other network connection was bad. Packets were refused, not dropped.
- Due to osd_fast_fail_on_connection_refused=true, the broken host forced the mons to take all other OSDs down (immediate failure).
- Only after shutting down the faulty host, was it possible to start the shut down OSDs, to restore the cluster.

We have since solved the problem by removing the default route that caused the packets to end up in the wrong network, where they were summarily rejected by a firewall. That is, we made sure that packets would be dropped in the future, not rejected.

Still, I figured I’ll send this experience of ours to this mailing list, as this seems to be something others might encounter as well.

In the following PR, that introduced osd_fast_fail_on_connection_refused, there’s this description:

> This changeset adds additional handler (handle_refused()) to the dispatchers
> and code that detects when connection attempt fails with ECONNREFUSED error
> (connection refused) which is a clear indication that host is alive, but
> daemon isn't, so daemons can instantly mark the other side as undoubtly
> downed without the need for grace timer.

And this comment:

> As for flapping, we discussed it on ceph-devel ml
> and came to conclusion that it requires either broken firewall or network
> configuration to cause this, and these are more serious issues that should
> be resolved first before worrying about OSDs flapping (either way, flapping
> OSDs could be good for getting someone's attention).

https://github.com/ceph/ceph/pull/8558https://github.com/ceph/ceph/pull/8558

It has left us wondering if these are the right assumptions. An ECONNREFUSED condition can bring down a whole cluster, and I wonder if there should be some kind of safe-guard to ensure that this is avoided. One badly configured host should generally not be able do that, and if the packets are dropped, instead of refused, the cluster notices that the OSD down reports come only from one host, and acts accordingly.

What do you think? Does this warrant a change in Ceph? I’m happy to provide details and create a ticket.

Cheers,

Denis
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx