I recently configured Prometheus to scrape mgr /metrics and add Grafana dashboards. All daemons currently at 15.2.11 I use Hashicorp consul to advertise the active mgr in DNS, and Prometheus points at a single DNS target. (Is anyone else using this method, or just statically pointing Prometheus at all potentially active managers?) All was working fine initially, and it's *mostly* still working fine. For the first couple of days, all went well, and then a few rate metrics stopped meaningfully increasing — essentially pegged at zero, which is implausible in a healthy cluster. Some cluster maintenance was occurring such as outing and recreating some OSDs, so I have a baseline for throughput and recovery. Metric graphs that stopped functioning: Throughput: ceph_osd_op_r_out_bytes, ceph_osd_op_w_in_bytes, ceph_osd_op_rw_in_bytes Recovery: ceph_osd_recovery_ops I can see that Grafana output is using this method of converting the counters to rates: sum(irate(ceph_osd_recovery_ops{job="$job"}[$interval])) The underlying counters appear to be sane, and reading the raw values from prometheus is also valid, so I'm guessing some failure either of the irate or sum functions? By inspection in Grafana, the queries return correct timestamps with zero values, so that leaves us with "sum(irate)" as the likely source of the problem. Does anyone have experience with this? I admit it is possibly tangential to ceph itself, but as the Prometheus/grafana integration is more or less supported, I thought I'd try here first among active mgr/Prometheus users. -- Jeremy Austin jhaustin@xxxxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx