Re: What are you doing to locate performance issues in a Ceph cluster?

Chris Kitzmiller <ckitzmiller@xxxxxxxxxxxxx> · Wed, 8 Apr 2015 08:41:09 -0400

On Apr 7, 2015, at 7:44 PM, Francois Lafont wrote:
> Chris Kitzmiller wrote:
> I graph aggregate stats for `ceph --admin-daemon 
>> /var/run/ceph/ceph-osd.$osdid.asok perf dump`. If the max latency strays too far 
>> outside of my mean latency I know to go look for the troublemaker. My graphs 
>> look something like this:
>> 
>> [...]
> 
> Thanks Chris for these interesting explanations.
> Sorry for my basic question but which is the entry in the output that gives
> you the read latency?
> 
> Here is an example from my cluster (Firefly):
> 
> ~# ceph --admin-daemon /var/run/ceph/ceph-osd.2.asok perf
> 
>  [...]
> 
>  "osd": { "opq": 0,
>      "op_wip": 0,
>      "op": 3566,
>      "op_in_bytes": 208803635,
>      "op_out_bytes": 146962506,
>      "op_latency": { "avgcount": 3566,
>          "sum": 100.330695000},
>      "op_process_latency": { "avgcount": 3566,
>          "sum": 84.702772000},
>      "op_r": 471,
>      "op_r_out_bytes": 146851024,
>      "op_r_latency": { "avgcount": 471,
>          "sum": 1.329795000},
> 
>   [...]
> 
> Is the value of "op_r_latency" (ie 1.329ms above)?
> In this case, I don't understand the meaning of "avgcount"
> and "sum".
> 
> "sum" is the sum of what?
> "avgcount" is the average of what?

There are a bunch of these avgcount/sum pairs and, from what I've gleaned, you're to simply divide sum by avgcount to get the mean of that particular stat over whatever time frame it is measuring.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com