Re: Slow cluster / misplaced objects - Ceph 15.2.9

David Orman <ormandj@xxxxxxxxxxxx> · Sat, 27 Feb 2021 09:22:43 -0600

Excellent, that's a great start. We do use prometheus/grafana already,
and we are collecting the data - so we'll make sure we add
alertmanager coverage. I was looking at dump_historic_ops but it
wasn't really showing _what_ the cause was; we'd see operations that
took a longer period of time (such as sub_op_pending), but not
necessarily to what.

Thank you,
David

On Sat, Feb 27, 2021 at 3:22 AM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:
>
> With older releases, Michael Kidd’s log parser scripts were invaluable, notably map_reporters_to_buckets.sh
>
> https://github.com/linuxkidd/ceph-log-parsers
>
>
> With newer releases, at least, one can send `dump_blocked_ops` to the OSD admin socket.  I collect these via Prometheus / node_exporter, it’s straightforward to visualize them in Grafana with queries per-OSD and per-node.  The builtin metrics might offer this data too.
>
>
> Often there’s a pattern — a given node/rack/OSD is the outlier for blocked ops, with a cohort of others affected via replication.
>
>
> Other things to look for are network packet drops or retransmits, crc/framing errors on the switch side, a drop in MemAvailabile, high load average, etc. also reported by node_exporter, OSD lifetimes / mon op latency / large OSD tcmalloc heap freelists (admin socket),
>
> I’m a big fan of Prometheus and Grafana.  It’s really straightforward to add one’s own stats too.  Drive write latency can be tracked with something like
>
> clamp_min(delta(node_disk_write_time_ms{ceph_role=“ssd",device=~"sd.*"}[5m])/delta(node_disk_writes_completed{ceph_role=“osd",device=~"sd.*"}[5m]),0)
>
> This can help identify outlier drives and firmware issues.
>
>
> Tracking drive e2e / UDMA / CRC errors, reallocated blocks (absolute and rate), lifetime remaining via SMART, though SMART is not as uniformly implemented as one would like so some interpretation and abstraction is warranted.
>
> — ymmv aad
>
> > I am curious, though, how one might have pin-pointed a troublesome
> > host/OSD prior to this. Looking back at some of the detail when
> > attempting to diagnose, we do see some ops taking longer in
> > sub_op_committed, but not really a lot else. We'd get an occasional
> > slow operation on OSD warning, but the OSDs were spread across various
> > ceph nodes, not just the one with issues, I'm assuming due to EC.
> >
> > There was no real clarity on where the 'jam' was happening, at least
> > in anything we looked at. I'm wondering if there's a better way to see
> > what, specifically, is "slow" on a cluster. Looking at even the OSD
> > perf output wasn't helpful, because all of that was fine - it was
> > likely due to EC and write operations to OSDs on that specific node in
> > question. Is there some way to look at a cluster and see which hosts
> > are problematic/leading to slowness in an EC-based setup?
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx