Pause cluster if node crashes?

Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> · Fri, 18 Feb 2022 10:22:19 +0000

Dear All,

Does ceph have any mechanism to automatically pause the cluster, and 
stop recovery if one node, or more than a set number of OSDs fail?

The reason for asking, is that last night, one of the 20 OSD nodes on 
our backup cluster crashed.

Ceph (of course) started recovering "lost data", so when we rebooted the 
failed node at 9am ~3% of the data on the cluster was misplaced.

It's going to take several days for the cluster to re-balance, during 
which we are going to have little I/O capacity for running backups, even 
if I reduce the recovery priority.

We can look at turning the watchdog on, giving nagios an action, etc, 
but I'd rather use any tools that ceph has built in.

BTW, this is an Octopus cluster 15.2.15, 580 x OSDs, using EC 8+2

best regards,

Jake

--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx