Re: What are you doing to locate performance issues in a Ceph cluster?

"Dan Ryder (daryder)" <daryder@xxxxxxxxx> · Wed, 8 Apr 2015 14:24:33 +0000

Yes, the unit is in seconds for those latencies. The sum/avgcount is the average since the daemon was (re)started. 

If you're interested, I've co-authored a collectd plugin which captures data from Ceph daemons - built into the plugin I give the option to calculate either the long-run avg (sum/avgcount) or the last-poll delta (sum_now-sum_last_poll/avgcount_now-avgcount_last_poll). It's been added to the latest collectd branch (https://github.com/collectd/collectd).

Dan Ryder

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Francois Lafont
Sent: Wednesday, April 08, 2015 10:11 AM
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  What are you doing to locate performance issues in a Ceph cluster?

Chris Kitzmiller wrote:

>> ~# ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok perf
>>
>>  [...]
>>
>>  "osd": { "opq": 0,
>>      "op_wip": 0,
>>      "op": 3566,
>>      "op_in_bytes": 208803635,
>>      "op_out_bytes": 146962506,
>>      "op_latency": { "avgcount": 3566,
>>          "sum": 100.330695000},
>>      "op_process_latency": { "avgcount": 3566,
>>          "sum": 84.702772000},
>>      "op_r": 471,
>>      "op_r_out_bytes": 146851024,
>>      "op_r_latency": { "avgcount": 471,
>>          "sum": 1.329795000},
>>
>>   [...]
>>
>> Is the value of "op_r_latency" (ie 1.329ms above)?
>> In this case, I don't understand the meaning of "avgcount"
>> and "sum".
>>
>> "sum" is the sum of what?
>> "avgcount" is the average of what?
> 
> There are a bunch of these avgcount/sum pairs and, from what I've gleaned, you're to simply divide sum by avgcount to get the mean of that particular stat over whatever time frame it is measuring.

Err..., I'm sorry, I'm not sure to well understand. If I take the values of op_r_latency above, I have:

    sum/avgcount = 1.329795000/471 = 0.002823344

0,002823344ms would be my latency of read operation?
It seems to me impossible (unfortunately ;)) or maybe the unit is in seconds?
In this case 2.823344ms could be a plausible value. In any case, I don't understand the name "avgcount". The name "count" seems to me more logical (but maybe I don't really have understand its meaning).

If I see the source code ./src/common/perf_counters.cc, it seems to me that, indeed, the number is in seconds but I'm (really) not a c++ expert.
Is possible to confirm to me that?

Another thing: if I understand well, the value sum/avgcount is an average of latency, average computed from the start of the osd daemon. So, after lot of times, the average will be more stable and it no longer incur variation.
Is it possible to restart the counters? I noticed that restarting the daemon works but it's a little drastic.

--
François Lafont
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com