Re: Seeking advice regarding collecting better client and recovery throughput metrics

Neha Ojha <nojha@xxxxxxxxxx> · Mon, 19 Jul 2021 10:24:43 -0700

Hi Surabhi,

I think both your approaches are fine. The recovery stats included
ceph status are aggregated at a cluster level, so if that's what you
are looking for, you could just poll the output of ceph status. For
finer-grained statistics on an OSD/PG level, you could use perf
counters or the extract relevant stats from the ceph pg dump output.

You could also use the ceph benchmarking tool's recovery test if you
like, details in https://github.com/ceph/cbt/pull/218.

Thanks,
Neha

On Mon, Jul 19, 2021 at 10:19 AM Josh Durgin <jdurgin@xxxxxxxxxx> wrote:
>
> Hello Surabh, you've got the right ideas already - collecting the
> recovery/client stats from perf dump on each OSD is the most accurate
> instantaneous measure.
>
> The stats reported by ceph -s are aggregated from all osds, but have
> some delay as the OSD sends updates to the mgr only overy 5 seconds by
> default (mgr_stats_period controls this).
>
> Depending on what you're trying to measure, the aggregated values in
> ceph -s may be easier to use - that's what Sridhar used in testing
> mclock QoS for example [0].
>
> Josh
>
> [0] https://docs.ceph.com/en/latest/dev/osd_internals/mclock_wpq_cmp_study/
>
> On 7/19/21 9:26 AM, Surabhi GUPTA wrote:
> > Hi,
> >
> > I am a graduate student at the university of wisconsin, Madison. I have
> > been trying to understand the recovery mechanisms in ceph and had a
> > question about accurate collection of metrics. My previous email might
> > have missed your notice, so sending a note again.
> >
> > Any help regarding this would be highly appreciated.
> >
> > Thank You,
> > Surabhi Gupta
> >
> > Get Outlook for iOS <https://aka.ms/o0ukef>
> > ------------------------------------------------------------------------
> > *From:* Surabhi GUPTA
> > *Sent:* Wednesday, July 14, 2021 5:47:47 PM
> > *To:* dev@xxxxxxx <dev@xxxxxxx>
> > *Subject:* Seeking advice regarding collecting better client and
> > recovery throughput metrics
> > Hi,
> >
> > I was running some experiments to measure client IO throughput and
> > recovery throughput in a ceph cluster. I am a bit uncertain if I am
> > collecting the metrics correctly. Could you please tell me if this is
> > the right way or if I can do anything better to collect more accurate
> > statistics?
> >
> > To generate load on the cluster, I use the rados bench utility and plot
> > the avg MB/s and cur MB/s values reported by the tool. For recovery, I
> > am periodically querying the perf dump for each osd and looking at the
> > recovery_ops and recovery_bytes. I then calculate the recovery
> > throughput based on the difference in values obtained on successive
> > querying and time difference between these queries.
> >
> > I also saw that ceph health displays client iops and recovery iops. So
> > one way is to periodically query "ceph -s", extract these values and use
> > them for the analysis.
> >
> > Could you please tell me which is the best way to obtain these metrics -
> > In the sense that which one exposes more accurate instantaneous
> > throughput values?
> > Is there any other method apart from these two that I should be looking at?
> >
> > I would greatly appreciate any help regarding this!
> >
> > Thank You,
> > Surabhi Gupta
> >
> > _______________________________________________
> > Dev mailing list -- dev@xxxxxxx
> > To unsubscribe send an email to dev-leave@xxxxxxx
> >
>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
>

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx