Re: log slow ops to cluster log

Neha Ojha <nojha@xxxxxxxxxx> · Mon, 29 Mar 2021 09:45:31 -0700

On Sat, Mar 27, 2021 at 11:15 AM Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>
> On 3/27/21 1:11 AM, Kefu Chai wrote:
> > hi folks,
> >
> > i want to raise your attention to the tracker ticket of
> > https://tracker.ceph.com/issues/48909
> > <https://tracker.ceph.com/issues/48909>. and discuss with you for a
> > better solution.
> >
> > some context first, back in https://github.com/ceph/ceph/pull/18614
> > <https://github.com/ceph/ceph/pull/18614>, changes were made so the slow
> > requests were reported to mgr to move the burden from monitor to mgr.
> > with that change, all health related reports are sent to mgr, and the
> > aggregated version is composed by mgr, and sent to monitor. i
> > think, that'd help to improve the scalability of a Ceph cluster.
> > moreover, IIUC, to let mgr take part of the load of the monitor was one
> > of the reasons why mgr was introduced in the first place.
> >
> > in https://tracker.ceph.com/issues/43975
> > <https://tracker.ceph.com/issues/43975>, it's reported that the slow ops
> > were no longer recorded in cluster log anymore since mimic. as a fix,
> > https://github.com/ceph/ceph/pull/33328
> > <https://github.com/ceph/ceph/pull/33328> was created to send slow ops
> > and their types to cluster log.
> >
> > in https://tracker.ceph.com/issues/43975
> > <https://tracker.ceph.com/issues/43975>, it's noticed that this fix even
> > worsen the performance  of a cluster suffering from slow ops by adding
> > more load to monitor. hence https://github.com/ceph/ceph/pull/39199
> > <https://github.com/ceph/ceph/pull/39199> was created to throttle this.
> >
> > i am wondering if we can make better use of the health reporting
> > machinery instead of pouring the health warnings to clog when slow ops
> > are observed?
> >
> > what do you think?
>
> Thanks for bringing this up Kefu, I agree there's a lot of room for
> improvement here. It'd be a good topic for CDS.

Agreed, added it to https://pad.ceph.com/p/cds-quincy.

Neha

>
> There's no reason the cluster log needs to go through paxos or be stored
> in the monitor DB, and some sort of throttling or data reduction would
> help on the producer side. We've seen issues not just with slow ops
> but other warnings reporting too frequently overloading the monitors
> as well.
>
> https://github.com/ceph/ceph/pull/40168 is related on the consumer side,
> and also helps other cases of temporary mon overload (e.g. from a burst
> of osdmap creation from blocklisting).
>
> Josh
>
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx