Yes, the unit is in seconds for those latencies. The sum/avgcount is the average since the daemon was (re)started. If you're interested, I've co-authored a collectd plugin which captures data from Ceph daemons - built into the plugin I give the option to calculate either the long-run avg (sum/avgcount) or the last-poll delta (sum_now-sum_last_poll/avgcount_now-avgcount_last_poll). It's been added to the latest collectd branch (https://github.com/collectd/collectd). Dan Ryder -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Francois Lafont Sent: Wednesday, April 08, 2015 10:11 AM To: ceph-users@xxxxxxxxxxxxxx Subject: Re: What are you doing to locate performance issues in a Ceph cluster? Chris Kitzmiller wrote: >> ~# ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok perf >> >> [...] >> >> "osd": { "opq": 0, >> "op_wip": 0, >> "op": 3566, >> "op_in_bytes": 208803635, >> "op_out_bytes": 146962506, >> "op_latency": { "avgcount": 3566, >> "sum": 100.330695000}, >> "op_process_latency": { "avgcount": 3566, >> "sum": 84.702772000}, >> "op_r": 471, >> "op_r_out_bytes": 146851024, >> "op_r_latency": { "avgcount": 471, >> "sum": 1.329795000}, >> >> [...] >> >> Is the value of "op_r_latency" (ie 1.329ms above)? >> In this case, I don't understand the meaning of "avgcount" >> and "sum". >> >> "sum" is the sum of what? >> "avgcount" is the average of what? > > There are a bunch of these avgcount/sum pairs and, from what I've gleaned, you're to simply divide sum by avgcount to get the mean of that particular stat over whatever time frame it is measuring. Err..., I'm sorry, I'm not sure to well understand. If I take the values of op_r_latency above, I have: sum/avgcount = 1.329795000/471 = 0.002823344 0,002823344ms would be my latency of read operation? It seems to me impossible (unfortunately ;)) or maybe the unit is in seconds? In this case 2.823344ms could be a plausible value. In any case, I don't understand the name "avgcount". The name "count" seems to me more logical (but maybe I don't really have understand its meaning). If I see the source code ./src/common/perf_counters.cc, it seems to me that, indeed, the number is in seconds but I'm (really) not a c++ expert. Is possible to confirm to me that? Another thing: if I understand well, the value sum/avgcount is an average of latency, average computed from the start of the osd daemon. So, after lot of times, the average will be more stable and it no longer incur variation. Is it possible to restart the counters? I noticed that restarting the daemon works but it's a little drastic. -- François Lafont _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com