With older releases, Michael Kidd’s log parser scripts were invaluable, notably map_reporters_to_buckets.sh https://github.com/linuxkidd/ceph-log-parsers With newer releases, at least, one can send `dump_blocked_ops` to the OSD admin socket. I collect these via Prometheus / node_exporter, it’s straightforward to visualize them in Grafana with queries per-OSD and per-node. The builtin metrics might offer this data too. Often there’s a pattern — a given node/rack/OSD is the outlier for blocked ops, with a cohort of others affected via replication. Other things to look for are network packet drops or retransmits, crc/framing errors on the switch side, a drop in MemAvailabile, high load average, etc. also reported by node_exporter, OSD lifetimes / mon op latency / large OSD tcmalloc heap freelists (admin socket), I’m a big fan of Prometheus and Grafana. It’s really straightforward to add one’s own stats too. Drive write latency can be tracked with something like clamp_min(delta(node_disk_write_time_ms{ceph_role=“ssd",device=~"sd.*"}[5m])/delta(node_disk_writes_completed{ceph_role=“osd",device=~"sd.*"}[5m]),0) This can help identify outlier drives and firmware issues. Tracking drive e2e / UDMA / CRC errors, reallocated blocks (absolute and rate), lifetime remaining via SMART, though SMART is not as uniformly implemented as one would like so some interpretation and abstraction is warranted. — ymmv aad > I am curious, though, how one might have pin-pointed a troublesome > host/OSD prior to this. Looking back at some of the detail when > attempting to diagnose, we do see some ops taking longer in > sub_op_committed, but not really a lot else. We'd get an occasional > slow operation on OSD warning, but the OSDs were spread across various > ceph nodes, not just the one with issues, I'm assuming due to EC. > > There was no real clarity on where the 'jam' was happening, at least > in anything we looked at. I'm wondering if there's a better way to see > what, specifically, is "slow" on a cluster. Looking at even the OSD > perf output wasn't helpful, because all of that was fine - it was > likely due to EC and write operations to OSDs on that specific node in > question. Is there some way to look at a cluster and see which hosts > are problematic/leading to slowness in an EC-based setup? _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx