Re: osd daemons still reading disks at full speed while there is no pool activity

Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> · Wed, 3 Nov 2021 07:56:58 -0600

Hi Nikola,

> I'm trying to catch ghost here.. On one of our clusters, 6 nodes,
> 14.2.15, EC pool 4+2, 6*32 SATA bluestore OSDs we got into very strange
> state.
>
> The cluster is clean (except for pgs not deep-scrubbed in time warning,
> since we've disabled scrubbing while investigating), there is absolutely
> no activity on EC pool, but according to atop, all OSDs are still reading
> furiously, without any apparent reason.

Was there PG movement (backfill) happening in this cluster recently?
Do the OSDs have stray PGs (e.g. 'ceph daemon osd.NN perf dump | grep
numpg_stray' - run this against an affected OSD from the housing
node)?

I'm wondering if you're running into
https://tracker.ceph.com/issues/45765, where cleaning of PGs from OSDs
leads to a high read rate from disk due to a combination of rocksdb
behaviour and caching issues. Turning on bluefs_buffered_io (on by
default in 14.2.22) is a mitigation for this problem, but has some
side effects to watch out for (write IOPS amplification, for one).
Fixes for that linked issue went into 14.2.17, 14.2.22, and then
Pacific; we found the 14.2.17 changes to be quite effective by
themselves.

Even if you don't have stray PGs, trying bluefs_buffered_io might be
an interesting experiment. An alternative would be to compact rocksdb
for each of your OSDs and see if that helps; compacting eliminates the
tombstoned data that can cause problems during iteration, but if you
have a workload that generates a lot of rocksdb tombstones (like PG
cleaning does), then the problem will return a while after compaction.

Josh
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx