Re: Pause cluster if node crashes?

Dan van der Ster <dvanders@xxxxxxxxx> · Fri, 18 Feb 2022 11:37:11 +0100

Hi,

Yes, this is the option you're looking for:

https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/#confval-mon_osd_down_out_subtree_limit

The default is rack -- you want to set that to "host".

Cheers, Dan

On Fri., Feb. 18, 2022, 11:23 Jake Grimmett, <jog@xxxxxxxxxxxxxxxxx> wrote:

> Dear All,
>
> Does ceph have any mechanism to automatically pause the cluster, and
> stop recovery if one node, or more than a set number of OSDs fail?
>
> The reason for asking, is that last night, one of the 20 OSD nodes on
> our backup cluster crashed.
>
> Ceph (of course) started recovering "lost data", so when we rebooted the
> failed node at 9am ~3% of the data on the cluster was misplaced.
>
> It's going to take several days for the cluster to re-balance, during
> which we are going to have little I/O capacity for running backups, even
> if I reduce the recovery priority.
>
> We can look at turning the watchdog on, giving nagios an action, etc,
> but I'd rather use any tools that ceph has built in.
>
> BTW, this is an Octopus cluster 15.2.15, 580 x OSDs, using EC 8+2
>
> best regards,
>
> Jake
>
> --
> Dr Jake Grimmett
> Head Of Scientific Computing
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx