Re: Ceph counters

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Mon, 20 Oct 2014 20:38:01 -0500

On 10/20/2014 08:22 PM, Craig Lewis wrote:
I've just started on this myself..

I started with https://ceph.com/docs/v0.80/dev/perf_counters/

I'm currently monitoring the latency, using the (to pick one example)
[op_w_latency][sum] and [op_w_latency][avgcount].  Both values are
counters, so they only increase with time.  The lifetime average latency
of the cluster isn't verify useful, so I track the deltas of those
values, then divide the recent deltas to get the average latency over my
sample period.

Just graphing the latencies let me see a spike in write latency on all
disks on one node, which eventually led me to a dead write-cache battery.

That's for the OSDs.  I have similar things setup for MON and RadosGW.

I'm sure there are many more useful things to graph.  One of things I'm
interested in (but haven't found time to research yet) is the journal
usage, with maybe some alerts if the journal is more than 90% full.

This is not likely to be an issue with the default journal config since 
the wbthrottle code is pretty aggressive about flushing the journal to 
avoid spiky client IO.  Having said that, I tend to agree that we need 
to do a better job of documenting everything from the perf counters to 
the states described in dump_historic_ops.  Even internally it can get 
confusing trying to keep track of what's going on where.

Mark

On Mon, Oct 13, 2014 at 2:57 PM, Jakes John <jakesjohn12345@xxxxxxxxx
<mailto:jakesjohn12345@xxxxxxxxx>> wrote:

    Bump:). It would be helpful, if someone can share info related to
    debugging using counters/stats

    On Sun, Oct 12, 2014 at 7:42 PM, Jakes John
    <jakesjohn12345@xxxxxxxxx <mailto:jakesjohn12345@xxxxxxxxx>> wrote:

        Hi All,
                   I would like to know if there are useful performance
        counters in ceph which can help to debug the cluster. I have
        seen hundreds of stat counters in various daemon dumps. Some of
        them are,

        1. commit_latency_ms
        2. apply_latency_ms
        3. snap_trim_queue_len
        4. num_snap_trimming

        What do these indicate?. .

        I have used iostat, atop for cluster statistics but, none of
        them indicate the internal ceph status.  Machines might be new
        but, osds can still be slow.  If some of these counters can help
        to debug why certain osds are bad( or can get bad later), it
        would be great. Some counters like total processed requests,
        pending requests in queue, avg time taken to process a request
        etc ?

        Are there any docs for all performance counters which I can
        read?. I couldn't find anything in ceph docs.

        Thanks

    _______________________________________________
    ceph-users mailing list
    ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
    http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com