Re: Ceph RBD/FS top

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Mon, 15 Apr 2019 11:21:35 -0700

On Fri, Apr 12, 2019 at 5:42 AM Venky Shankar <vshankar@xxxxxxxxxx> wrote:
>
> On Fri, Apr 12, 2019 at 5:03 PM Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
> >
> > (dropped ceph-fs since I just got a "needs approval" bounce from it last time)
> >
> > On Fri, Apr 12, 2019 at 4:27 AM Venky Shankar <vshankar@xxxxxxxxxx> wrote:
> > >
> > > On Thu, Apr 11, 2019 at 6:58 PM Jason Dillaman <jdillama@xxxxxxxxxx> wrote:
> > > >
> > > > (CCing ceph-devel since I think it's odd to segregate topics to an
> > > > unlisted mailing list)
> > > >
> > > > On Thu, Apr 11, 2019 at 9:12 AM Venky Shankar <vshankar@xxxxxxxxxx> wrote:
> > > > >
> > > > > Hey Jason,
> > > > >
> > > > > We are working towards bringing in `top` like functionality to CephFS
> > > > > for displaying various client (and MDS) metrics. Since RBD has
> > > > > something similar in the form of `perf image io*` via rbd cli, we
> > > > > would like to understand some finer details regarding its
> > > > > implementation and detail how CephFS is going forward for `fs top`
> > > > > functionality.
> > > > >
> > > > > IIUC, the `rbd_support` manager module requests object perf counters
> > > > > from the OSD, thereby extracting image names from the returned list of
> > > >
> > > > Technically it extracts the image ids since that's the only thing
> > > > encoded in the object name. The "rbd_support" manager module will
> > > > lazily translate the image ids back to a real image name as needed.
> > > >
> > > > > hot objects. I'm guessing it's done this way since there is no RBD
> > > > > related active daemon to forward metrics data to the manager? OTOH,
> > > >
> > > > It's because we are tracking client IO and we don't have a daemon in
> > > > the data path -- the OSDs are the only daemon in the IO path for RBD.
> > >
> > > ACK.
> > >
> > > >
> > > > `rbd-mirror` does make use of
> > > > > `MgrClient::service_daemon_update_status()` to forward mirror daemon
> > > > > status, which seems to be ok for anything that's not too bulky.
> > > >
> > > > It's storing metrics that only it knows about. The good parallel
> > > > analogy would be for the MDS to export metrics for things that only it
> > > > would know about (e.g. the number of clients or caps, metadata
> > > > read/write rates). The "rbd-mirror" daemon stores JSON-encoded
> > > > metadata via the "service_daemon_update_status" API, but it also
> > > > passes PerfCounter metrics automatically to the MGR (see the usage of
> > > > the "rbd_mirror_perf_stats_prio" config option).
> > > >
> > > > > For forwarding CephFS related metrics to Ceph Manager, sticking in
> > > > > blobs of metrics data in daemon status doesn't look clean (although it
> > > > > might work). Therefore, for CephFS, `MMgrReport` message type is
> > > > > expanded to include metrics data as part of its report update process,
> > > > > as per:
> > > > >
> > > > >         https://github.com/ceph/ceph/pull/26004/commits/a75570c0e73ef67bbca8f73a9742e10bb9deb505#diff-b7b92973d97c21398c2be357f6a38b3e
> > > >
> > > > Just my 2 cents, but I think it's awkward to put an MDS-unique data
> > > > structure in a generic message.I would think most (if not all) of
> > >
> > > Agreed, that's a bit awkward -- but MMgrReport already has OSD
> > > specific data in there.
> >
> > Figured that the OSDs represent the vast majority of daemons in a Ceph
> > cluster, so they are probably first-tier citizens. We wouldn't want to
> > go down a road with MDS+RGW+NFS ganasha+iSCSI tcmu-runner+RBD
> > mirror+... one-offs.
>
> True -- don't want to pollute generic message types with daemon
> specific data. As you mentioned, OSD is probably an exception.
>
> Or, generalize it to support MDS (and other daemons when needed).

Another option is free-form JSON that is delivered (?) to a particular
mgr module.

> > > > your MDS metrics could be passed generically via the PerfCounter
> > > > export mechanism.
> > >
> > > Probably, but that would be just aggregated values, right? We would
> > > need per-client metrics.
> >
> > What metrics are you attempting to collect from the client to report
> > back to the MGR?
>
> pretty basic as of now:
> - client capability hits
> - OSDC cache hits, readahead util
>
> along with a snapshot of all sessions w/ per-session stats.
>
> Does the MDS already have these client metrics? Can
> > the MDS not just provide its own "MDS command" I/F to query those
> > metrics a la what "rbd_support" is providing in the MGR?
>
> That's where I was coming to -- MDS (rank 0) would have all the
> metrics that would be shown as part of "top". The MGR can poll the MDS
> for client metadata and only poll the session list if it sees a
> client-id in OSD stat that it doesn't know about.

Polling doesn't feel like the right approach here. The MDS should just
periodically forward all of these statistics. I also don't see why we
need the OSDs involved.

> I'm thinking if forwarding data to the manager would bring benefit in
> the form of caching, etc.. done by MGR.

It also allows the mgr to present the data in the form of graphs on
the dashboard. As suggested elsewhere, I don't think having some
script talk to the MDS to present a CephFS iotop is the way to go. For
better or worse, the mgr is where we handle cluster-wide performance
metadata.

-- 
Patrick Donnelly