Dear All,
Does ceph have any mechanism to automatically pause the cluster, and
stop recovery if one node, or more than a set number of OSDs fail?
The reason for asking, is that last night, one of the 20 OSD nodes on
our backup cluster crashed.
Ceph (of course) started recovering "lost data", so when we rebooted the
failed node at 9am ~3% of the data on the cluster was misplaced.
It's going to take several days for the cluster to re-balance, during
which we are going to have little I/O capacity for running backups, even
if I reduce the recovery priority.
We can look at turning the watchdog on, giving nagios an action, etc,
but I'd rather use any tools that ceph has built in.
BTW, this is an Octopus cluster 15.2.15, 580 x OSDs, using EC 8+2
best regards,
Jake
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx