Re: Seeking advice regarding collecting better client and recovery throughput metrics

Josh Durgin <jdurgin@xxxxxxxxxx> · Mon, 19 Jul 2021 10:18:58 -0700

Hello Surabh, you've got the right ideas already - collecting the 
recovery/client stats from perf dump on each OSD is the most accurate 
instantaneous measure.

The stats reported by ceph -s are aggregated from all osds, but have
some delay as the OSD sends updates to the mgr only overy 5 seconds by
default (mgr_stats_period controls this).

Depending on what you're trying to measure, the aggregated values in 
ceph -s may be easier to use - that's what Sridhar used in testing 
mclock QoS for example [0].

Josh

[0] https://docs.ceph.com/en/latest/dev/osd_internals/mclock_wpq_cmp_study/

On 7/19/21 9:26 AM, Surabhi GUPTA wrote:
Hi,

I am a graduate student at the university of wisconsin, Madison. I have 
been trying to understand the recovery mechanisms in ceph and had a 
question about accurate collection of metrics. My previous email might 
have missed your notice, so sending a note again.

Any help regarding this would be highly appreciated.

Thank You,
Surabhi Gupta

Get Outlook for iOS <https://aka.ms/o0ukef>
------------------------------------------------------------------------
*From:* Surabhi GUPTA
*Sent:* Wednesday, July 14, 2021 5:47:47 PM
*To:* dev@xxxxxxx <dev@xxxxxxx>
*Subject:* Seeking advice regarding collecting better client and 
recovery throughput metrics
Hi,

I was running some experiments to measure client IO throughput and 
recovery throughput in a ceph cluster. I am a bit uncertain if I am 
collecting the metrics correctly. Could you please tell me if this is 
the right way or if I can do anything better to collect more accurate 
statistics?

To generate load on the cluster, I use the rados bench utility and plot 
the avg MB/s and cur MB/s values reported by the tool. For recovery, I 
am periodically querying the perf dump for each osd and looking at the 
recovery_ops and recovery_bytes. I then calculate the recovery 
throughput based on the difference in values obtained on successive 
querying and time difference between these queries.

I also saw that ceph health displays client iops and recovery iops. So 
one way is to periodically query "ceph -s", extract these values and use 
them for the analysis.

Could you please tell me which is the best way to obtain these metrics - 
In the sense that which one exposes more accurate instantaneous 
throughput values?
Is there any other method apart from these two that I should be looking at?

I would greatly appreciate any help regarding this!

Thank You,
Surabhi Gupta

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx