Re: Ceph meltdown, need help

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Marc,

thank you for your endurance. I had another slightly different "meltdown", this time throwing the MGRs out and I adjusted yet another beacon grace time. Fortunately, after your communication, I didn't need to look very long.

To harden our cluster a bit further, I would like to adjust a number of advanced parameters I found after your hints. I would be most grateful if you (or anyone else receiving this) still have enough endurance left and could check whether what I want to do makes sense and if the choices I suggest will achieve what I want.

Parameters with section of documentation, default in "{}", current value plain, new value prefixed with "*". There is an error in the documentation, please let me know if my interpretation is correct.


MON-MGR beacon adjustments
--------------------------
https://docs.ceph.com/docs/mimic/mgr/administrator/

mon mgr beacon grace {30}              300

This helped mitigating the second type of meltdown. I took 2 times the longest observed "mon slow op" time to be safe (MGR beacon handling was slow op). Our MGRs are no longer thrown out in case of the incident (see very end for more info).


MON-OSD communication adjustments
------------------------
https://docs.ceph.com/docs/mimic/rados/configuration/mon-osd-interaction/

osd beacon report interval {300}        300
mon osd report timeout {900}            3600
mon osd min down reporters {2}         *3
mon osd reporter subtree level {host}  *datacenter
mon osd down out subtree limit {rack}  *host

"mon osd report timeout" is increased after your recommendation. It is set to a really high value as I don't see this critical for fail-over (the default time-out suggests that this is merely for clean-up and not essential for healthy I/O). OSDs are no longer thrown out in case of the incident (see very end for more info).

"down reporter options": We have 3 sites (sub-clusters) under region in our crush map (see below). Each of these regions can be considered "equally laggy" as described in the documentation. I do not want a laggy site to mark down OSDs from another (healthy) site without a single OSD of the other site confirming an issue. I would like to require that at least 1 OSD from each site needs to report an OSD down before something happens. Does "3" and "datacenter" achieve what I want? Is this a reasonable choice with our crush map?

Note that, as a speciality, DC2 currently links to some hosts of DC3 (to change in the future).

"mon osd down out subtree limit": A host in our cluster is currently the atomic unit which, if it goes down, should not trigger rebalancing on the cluster as this indicates a server and not a disk fail. In addition, if I understand it correctly, this will also act as an automatic "noout" on host level if, for example, a host gets rebooted.


mon osd laggy *

I saw tuning parameters for laggy OSDs. However, our incidents happen very sporadically and are extremely radical. I do not think that any reasonable estimator will be able to handle that. So my working hypothesis is, that I should not touch these.


Error in documentation
--------------------

https://docs.ceph.com/docs/mimic/rados/configuration/mon-osd-interaction/#osds-report-their-status

osd_mon_report_interval_max {Error ENOENT:}
osd beacon report interval

The documentation mentions "osd mon report interval max", which doesn't exist. However "osd beacon report interval" exists but is not mentioned. I assume the second replaced the first?


Condensed crush tree
--------------------

region R1
    datacenter DC1
        room DC1-R1
            host ceph-08            host ceph-09            host ceph-10            host ceph-11
            host ceph-12            host ceph-13            host ceph-14            host ceph-15
            host ceph-16            host ceph-17
    datacenter DC2
        host ceph-04        host ceph-05        host ceph-06        host ceph-07
        host ceph-18        host ceph-19        host ceph-20    datacenter DC3
        room DC3-R1
            host ceph-04            host ceph-05            host ceph-06            host ceph-07
            host ceph-18            host ceph-19            host ceph-20            host ceph-21
            host ceph-22

Additional info about our meltdowns:

With "mon mgr beacon grace" and "mon osd report timeout" set to really high values, I finally managed to isolate a signal in our recordings that is connected with these strange incidents. It looks like a package storm is hitting exactly two MON+MGR nodes, leading to beacon time-outs with default settings. I will not continue this here, but rather prepare another thread "Cluster outage due to client IO" after checking network hardware. It looks as if two MON+MGR nodes are desperately trying to talk to each other but fail.

And this after only 1.5 years of relationship :)

Thanks for making it a second time!

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Marc Roos <M.Roos@xxxxxxxxxxxxxxxxx>
Sent: 06 May 2020 19:19
To: ag; brad.swanson; dan; Frank Schilder
Cc: ceph-users
Subject: RE:  Re: Ceph meltdown, need help

Made it all the way down ;) Thank you very much for the detailed info.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux