Re: Slow cluster / misplaced objects - Ceph 15.2.9

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Sat, 27 Feb 2021 01:22:12 -0800

With older releases, Michael Kidd’s log parser scripts were invaluable, notably map_reporters_to_buckets.sh

https://github.com/linuxkidd/ceph-log-parsers

With newer releases, at least, one can send `dump_blocked_ops` to the OSD admin socket.  I collect these via Prometheus / node_exporter, it’s straightforward to visualize them in Grafana with queries per-OSD and per-node.  The builtin metrics might offer this data too.

Often there’s a pattern — a given node/rack/OSD is the outlier for blocked ops, with a cohort of others affected via replication.

Other things to look for are network packet drops or retransmits, crc/framing errors on the switch side, a drop in MemAvailabile, high load average, etc. also reported by node_exporter, OSD lifetimes / mon op latency / large OSD tcmalloc heap freelists (admin socket), 

I’m a big fan of Prometheus and Grafana.  It’s really straightforward to add one’s own stats too.  Drive write latency can be tracked with something like

clamp_min(delta(node_disk_write_time_ms{ceph_role=“ssd",device=~"sd.*"}[5m])/delta(node_disk_writes_completed{ceph_role=“osd",device=~"sd.*"}[5m]),0)

This can help identify outlier drives and firmware issues.

Tracking drive e2e / UDMA / CRC errors, reallocated blocks (absolute and rate), lifetime remaining via SMART, though SMART is not as uniformly implemented as one would like so some interpretation and abstraction is warranted.

— ymmv aad

> I am curious, though, how one might have pin-pointed a troublesome
> host/OSD prior to this. Looking back at some of the detail when
> attempting to diagnose, we do see some ops taking longer in
> sub_op_committed, but not really a lot else. We'd get an occasional
> slow operation on OSD warning, but the OSDs were spread across various
> ceph nodes, not just the one with issues, I'm assuming due to EC.
> 
> There was no real clarity on where the 'jam' was happening, at least
> in anything we looked at. I'm wondering if there's a better way to see
> what, specifically, is "slow" on a cluster. Looking at even the OSD
> perf output wasn't helpful, because all of that was fine - it was
> likely due to EC and write operations to OSDs on that specific node in
> question. Is there some way to look at a cluster and see which hosts
> are problematic/leading to slowness in an EC-based setup?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx