Excellent, that's a great start. We do use prometheus/grafana already, and we are collecting the data - so we'll make sure we add alertmanager coverage. I was looking at dump_historic_ops but it wasn't really showing _what_ the cause was; we'd see operations that took a longer period of time (such as sub_op_pending), but not necessarily to what. Thank you, David On Sat, Feb 27, 2021 at 3:22 AM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote: > > With older releases, Michael Kidd’s log parser scripts were invaluable, notably map_reporters_to_buckets.sh > > https://github.com/linuxkidd/ceph-log-parsers > > > With newer releases, at least, one can send `dump_blocked_ops` to the OSD admin socket. I collect these via Prometheus / node_exporter, it’s straightforward to visualize them in Grafana with queries per-OSD and per-node. The builtin metrics might offer this data too. > > > Often there’s a pattern — a given node/rack/OSD is the outlier for blocked ops, with a cohort of others affected via replication. > > > Other things to look for are network packet drops or retransmits, crc/framing errors on the switch side, a drop in MemAvailabile, high load average, etc. also reported by node_exporter, OSD lifetimes / mon op latency / large OSD tcmalloc heap freelists (admin socket), > > I’m a big fan of Prometheus and Grafana. It’s really straightforward to add one’s own stats too. Drive write latency can be tracked with something like > > clamp_min(delta(node_disk_write_time_ms{ceph_role=“ssd",device=~"sd.*"}[5m])/delta(node_disk_writes_completed{ceph_role=“osd",device=~"sd.*"}[5m]),0) > > This can help identify outlier drives and firmware issues. > > > Tracking drive e2e / UDMA / CRC errors, reallocated blocks (absolute and rate), lifetime remaining via SMART, though SMART is not as uniformly implemented as one would like so some interpretation and abstraction is warranted. > > — ymmv aad > > > I am curious, though, how one might have pin-pointed a troublesome > > host/OSD prior to this. Looking back at some of the detail when > > attempting to diagnose, we do see some ops taking longer in > > sub_op_committed, but not really a lot else. We'd get an occasional > > slow operation on OSD warning, but the OSDs were spread across various > > ceph nodes, not just the one with issues, I'm assuming due to EC. > > > > There was no real clarity on where the 'jam' was happening, at least > > in anything we looked at. I'm wondering if there's a better way to see > > what, specifically, is "slow" on a cluster. Looking at even the OSD > > perf output wasn't helpful, because all of that was fine - it was > > likely due to EC and write operations to OSDs on that specific node in > > question. Is there some way to look at a cluster and see which hosts > > are problematic/leading to slowness in an EC-based setup? > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx