mgr+Prometheus, grafana, consul

Jeremy Austin <jhaustin@xxxxxxxxx> · Fri, 21 May 2021 12:04:59 -0800

I recently configured Prometheus to scrape mgr /metrics and add
Grafana dashboards.
All daemons currently at 15.2.11

I use Hashicorp consul to advertise the active mgr in DNS, and Prometheus
points at a single DNS target. (Is anyone else using this method, or just
statically pointing Prometheus at all potentially active managers?)

All was working fine initially, and it's *mostly* still working fine. For
the first couple of days, all went well, and then a few rate metrics
stopped meaningfully increasing — essentially pegged at zero, which is
implausible in a healthy cluster. Some cluster maintenance was occurring
such as outing and recreating some OSDs, so I have a baseline for
throughput and recovery.

Metric graphs that stopped functioning:
Throughput: ceph_osd_op_r_out_bytes, ceph_osd_op_w_in_bytes,
ceph_osd_op_rw_in_bytes
Recovery: ceph_osd_recovery_ops

I can see that Grafana output is using this method of converting the
counters to rates:
sum(irate(ceph_osd_recovery_ops{job="$job"}[$interval]))

The underlying counters appear to be sane, and reading the raw values from
prometheus is also valid, so I'm guessing some failure either of the irate
or sum functions? By inspection in Grafana, the queries return correct
timestamps with zero values, so that leaves us with "sum(irate)" as the
likely source of the problem.

 Does anyone have experience with this? I admit it is possibly tangential
to ceph itself, but as the Prometheus/grafana integration is more or less
supported, I thought I'd try here first among active mgr/Prometheus users.

-- 
Jeremy Austin
jhaustin@xxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx