Hi, Yes, this is the option you're looking for: https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/#confval-mon_osd_down_out_subtree_limit The default is rack -- you want to set that to "host". Cheers, Dan On Fri., Feb. 18, 2022, 11:23 Jake Grimmett, <jog@xxxxxxxxxxxxxxxxx> wrote: > Dear All, > > Does ceph have any mechanism to automatically pause the cluster, and > stop recovery if one node, or more than a set number of OSDs fail? > > The reason for asking, is that last night, one of the 20 OSD nodes on > our backup cluster crashed. > > Ceph (of course) started recovering "lost data", so when we rebooted the > failed node at 9am ~3% of the data on the cluster was misplaced. > > It's going to take several days for the cluster to re-balance, during > which we are going to have little I/O capacity for running backups, even > if I reduce the recovery priority. > > We can look at turning the watchdog on, giving nagios an action, etc, > but I'd rather use any tools that ceph has built in. > > BTW, this is an Octopus cluster 15.2.15, 580 x OSDs, using EC 8+2 > > best regards, > > Jake > > -- > Dr Jake Grimmett > Head Of Scientific Computing > MRC Laboratory of Molecular Biology > Francis Crick Avenue, > Cambridge CB2 0QH, UK. > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx