On 10/20/2014 08:22 PM, Craig Lewis wrote:
I've just started on this myself.. I started with https://ceph.com/docs/v0.80/dev/perf_counters/ I'm currently monitoring the latency, using the (to pick one example) [op_w_latency][sum] and [op_w_latency][avgcount]. Both values are counters, so they only increase with time. The lifetime average latency of the cluster isn't verify useful, so I track the deltas of those values, then divide the recent deltas to get the average latency over my sample period. Just graphing the latencies let me see a spike in write latency on all disks on one node, which eventually led me to a dead write-cache battery. That's for the OSDs. I have similar things setup for MON and RadosGW. I'm sure there are many more useful things to graph. One of things I'm interested in (but haven't found time to research yet) is the journal usage, with maybe some alerts if the journal is more than 90% full.
This is not likely to be an issue with the default journal config since the wbthrottle code is pretty aggressive about flushing the journal to avoid spiky client IO. Having said that, I tend to agree that we need to do a better job of documenting everything from the perf counters to the states described in dump_historic_ops. Even internally it can get confusing trying to keep track of what's going on where.
Mark
On Mon, Oct 13, 2014 at 2:57 PM, Jakes John <jakesjohn12345@xxxxxxxxx <mailto:jakesjohn12345@xxxxxxxxx>> wrote: Bump:). It would be helpful, if someone can share info related to debugging using counters/stats On Sun, Oct 12, 2014 at 7:42 PM, Jakes John <jakesjohn12345@xxxxxxxxx <mailto:jakesjohn12345@xxxxxxxxx>> wrote: Hi All, I would like to know if there are useful performance counters in ceph which can help to debug the cluster. I have seen hundreds of stat counters in various daemon dumps. Some of them are, 1. commit_latency_ms 2. apply_latency_ms 3. snap_trim_queue_len 4. num_snap_trimming What do these indicate?. . I have used iostat, atop for cluster statistics but, none of them indicate the internal ceph status. Machines might be new but, osds can still be slow. If some of these counters can help to debug why certain osds are bad( or can get bad later), it would be great. Some counters like total processed requests, pending requests in queue, avg time taken to process a request etc ? Are there any docs for all performance counters which I can read?. I couldn't find anything in ceph docs. Thanks _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com