Pause cluster if node crashes?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear All,

Does ceph have any mechanism to automatically pause the cluster, and stop recovery if one node, or more than a set number of OSDs fail?

The reason for asking, is that last night, one of the 20 OSD nodes on our backup cluster crashed.

Ceph (of course) started recovering "lost data", so when we rebooted the failed node at 9am ~3% of the data on the cluster was misplaced.

It's going to take several days for the cluster to re-balance, during which we are going to have little I/O capacity for running backups, even if I reduce the recovery priority.

We can look at turning the watchdog on, giving nagios an action, etc, but I'd rather use any tools that ceph has built in.

BTW, this is an Octopus cluster 15.2.15, 580 x OSDs, using EC 8+2

best regards,

Jake

--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux