Hello Josh, just wanted to confirm that setting bluefs_buffered_io immediately helped hotfix the problem. I've also updated to 14.2.22 and we'll discuss adding more NVME modules to move OSD databases out of spinners to prevent further occurances thanks a lot for your time! with best regards nikola ciprich On Wed, Nov 03, 2021 at 09:11:20AM -0600, Josh Baergen wrote: > Hi Nikola, > > > yes, some nodes have stray pgs (1..5) shell I do something about those? > > No need to do anything - Ceph will clean those up itself (and is doing > so right now). I just wanted to confirm my hunch. > > Enabling buffered I/O should have an immediate effect on the read rate > to your disks. I would recommend upgrading to 14.2.17+, though, as the > improvements to PG cleaning are pretty substantial. > > Josh > > On Wed, Nov 3, 2021 at 8:13 AM Nikola Ciprich > <nikola.ciprich@xxxxxxxxxxx> wrote: > > > > Hello Josh, > > > > > > Was there PG movement (backfill) happening in this cluster recently? > > > Do the OSDs have stray PGs (e.g. 'ceph daemon osd.NN perf dump | grep > > > numpg_stray' - run this against an affected OSD from the housing > > > node)? > > yes, some nodes have stray pgs (1..5) shell I do something about those? > > > > > > > > > > I'm wondering if you're running into > > > https://tracker.ceph.com/issues/45765, where cleaning of PGs from OSDs > > hmm, yes, this seems very familiar, problems started with using balancer, > > forgot to mention that! > > > > > leads to a high read rate from disk due to a combination of rocksdb > > > behaviour and caching issues. Turning on bluefs_buffered_io (on by > > > default in 14.2.22) is a mitigation for this problem, but has some > > > side effects to watch out for (write IOPS amplification, for one). > > > Fixes for that linked issue went into 14.2.17, 14.2.22, and then > > > Pacific; we found the 14.2.17 changes to be quite effective by > > > themselves. > > > > > > Even if you don't have stray PGs, trying bluefs_buffered_io might be > > > an interesting experiment. An alternative would be to compact rocksdb > > > for each of your OSDs and see if that helps; compacting eliminates the > > > tombstoned data that can cause problems during iteration, but if you > > > have a workload that generates a lot of rocksdb tombstones (like PG > > > cleaning does), then the problem will return a while after compaction. > > > > > > > hmm, I'll try enabling bluefs_buffered_io (it was indeed false) and do the > > compaction as well anyways.. > > > > I'll report back, thanks for the hints! > > > > BR > > > > nik > > > > > > > Josh > > > > > > > -- > > ------------------------------------- > > Ing. Nikola CIPRICH > > LinuxBox.cz, s.r.o. > > 28.rijna 168, 709 00 Ostrava > > > > tel.: +420 591 166 214 > > fax: +420 596 621 273 > > mobil: +420 777 093 799 > > www.linuxbox.cz > > > > mobil servis: +420 737 238 656 > > email servis: servis@xxxxxxxxxxx > > ------------------------------------- > -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@xxxxxxxxxxx ------------------------------------- _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx