Re: Pause cluster if node crashes?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dan,

Thanks for that - it's exactly the setting we needed :)

Have a good weekend,

Jake

On 2/18/22 10:37, Dan van der Ster wrote:
Hi,

Yes, this is the option you're looking for:

https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/#confval-mon_osd_down_out_subtree_limit <https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/#confval-mon_osd_down_out_subtree_limit>

The default is rack -- you want to set that to "host".

Cheers, Dan



On Fri., Feb. 18, 2022, 11:23 Jake Grimmett, <jog@xxxxxxxxxxxxxxxxx <mailto:jog@xxxxxxxxxxxxxxxxx>> wrote:

    Dear All,

    Does ceph have any mechanism to automatically pause the cluster, and
    stop recovery if one node, or more than a set number of OSDs fail?

    The reason for asking, is that last night, one of the 20 OSD nodes on
    our backup cluster crashed.

    Ceph (of course) started recovering "lost data", so when we rebooted
    the
    failed node at 9am ~3% of the data on the cluster was misplaced.

    It's going to take several days for the cluster to re-balance, during
    which we are going to have little I/O capacity for running backups,
    even
    if I reduce the recovery priority.

    We can look at turning the watchdog on, giving nagios an action, etc,
    but I'd rather use any tools that ceph has built in.

    BTW, this is an Octopus cluster 15.2.15, 580 x OSDs, using EC 8+2

    best regards,

    Jake

-- Dr Jake Grimmett
    Head Of Scientific Computing
    MRC Laboratory of Molecular Biology
    Francis Crick Avenue,
    Cambridge CB2 0QH, UK.

    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
    To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>



For help, read https://www.mrc-lmb.cam.ac.uk/scicomp/
then contact unixadmin@xxxxxxxxxxxxxxxxx
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux