On Sat, Mar 27, 2021 at 11:15 AM Josh Durgin <jdurgin@xxxxxxxxxx> wrote: > > On 3/27/21 1:11 AM, Kefu Chai wrote: > > hi folks, > > > > i want to raise your attention to the tracker ticket of > > https://tracker.ceph.com/issues/48909 > > <https://tracker.ceph.com/issues/48909>. and discuss with you for a > > better solution. > > > > some context first, back in https://github.com/ceph/ceph/pull/18614 > > <https://github.com/ceph/ceph/pull/18614>, changes were made so the slow > > requests were reported to mgr to move the burden from monitor to mgr. > > with that change, all health related reports are sent to mgr, and the > > aggregated version is composed by mgr, and sent to monitor. i > > think, that'd help to improve the scalability of a Ceph cluster. > > moreover, IIUC, to let mgr take part of the load of the monitor was one > > of the reasons why mgr was introduced in the first place. > > > > in https://tracker.ceph.com/issues/43975 > > <https://tracker.ceph.com/issues/43975>, it's reported that the slow ops > > were no longer recorded in cluster log anymore since mimic. as a fix, > > https://github.com/ceph/ceph/pull/33328 > > <https://github.com/ceph/ceph/pull/33328> was created to send slow ops > > and their types to cluster log. > > > > in https://tracker.ceph.com/issues/43975 > > <https://tracker.ceph.com/issues/43975>, it's noticed that this fix even > > worsen the performance of a cluster suffering from slow ops by adding > > more load to monitor. hence https://github.com/ceph/ceph/pull/39199 > > <https://github.com/ceph/ceph/pull/39199> was created to throttle this. > > > > i am wondering if we can make better use of the health reporting > > machinery instead of pouring the health warnings to clog when slow ops > > are observed? > > > > what do you think? > > Thanks for bringing this up Kefu, I agree there's a lot of room for > improvement here. It'd be a good topic for CDS. Agreed, added it to https://pad.ceph.com/p/cds-quincy. Neha > > There's no reason the cluster log needs to go through paxos or be stored > in the monitor DB, and some sort of throttling or data reduction would > help on the producer side. We've seen issues not just with slow ops > but other warnings reporting too frequently overloading the monitors > as well. > > https://github.com/ceph/ceph/pull/40168 is related on the consumer side, > and also helps other cases of temporary mon overload (e.g. from a burst > of osdmap creation from blocklisting). > > Josh > _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx