Re: osd daemons still reading disks at full speed while there is no pool activity

Nikola Ciprich <nikola.ciprich@xxxxxxxxxxx> · Tue, 9 Nov 2021 15:38:23 +0100

Hello Josh,

just wanted to confirm that setting bluefs_buffered_io immediately
helped hotfix the problem. I've also updated to 14.2.22 and we'll
discuss adding more NVME modules to move OSD databases out of spinners
to prevent further occurances

thanks a lot for your time!

with best regards

nikola ciprich

On Wed, Nov 03, 2021 at 09:11:20AM -0600, Josh Baergen wrote:
> Hi Nikola,
> 
> > yes, some nodes have stray pgs (1..5)  shell I do something about those?
> 
> No need to do anything - Ceph will clean those up itself (and is doing
> so right now). I just wanted to confirm my hunch.
> 
> Enabling buffered I/O should have an immediate effect on the read rate
> to your disks. I would recommend upgrading to 14.2.17+, though, as the
> improvements to PG cleaning are pretty substantial.
> 
> Josh
> 
> On Wed, Nov 3, 2021 at 8:13 AM Nikola Ciprich
> <nikola.ciprich@xxxxxxxxxxx> wrote:
> >
> > Hello Josh,
> > >
> > > Was there PG movement (backfill) happening in this cluster recently?
> > > Do the OSDs have stray PGs (e.g. 'ceph daemon osd.NN perf dump | grep
> > > numpg_stray' - run this against an affected OSD from the housing
> > > node)?
> > yes, some nodes have stray pgs (1..5)  shell I do something about those?
> >
> >
> > >
> > > I'm wondering if you're running into
> > > https://tracker.ceph.com/issues/45765, where cleaning of PGs from OSDs
> > hmm, yes, this seems very familiar, problems started with using balancer,
> > forgot to mention that!
> >
> > > leads to a high read rate from disk due to a combination of rocksdb
> > > behaviour and caching issues. Turning on bluefs_buffered_io (on by
> > > default in 14.2.22) is a mitigation for this problem, but has some
> > > side effects to watch out for (write IOPS amplification, for one).
> > > Fixes for that linked issue went into 14.2.17, 14.2.22, and then
> > > Pacific; we found the 14.2.17 changes to be quite effective by
> > > themselves.
> > >
> > > Even if you don't have stray PGs, trying bluefs_buffered_io might be
> > > an interesting experiment. An alternative would be to compact rocksdb
> > > for each of your OSDs and see if that helps; compacting eliminates the
> > > tombstoned data that can cause problems during iteration, but if you
> > > have a workload that generates a lot of rocksdb tombstones (like PG
> > > cleaning does), then the problem will return a while after compaction.
> > >
> >
> > hmm, I'll try enabling bluefs_buffered_io (it was indeed false) and do the
> > compaction as well anyways..
> >
> > I'll report back, thanks for the hints!
> >
> > BR
> >
> > nik
> >
> >
> > > Josh
> > >
> >
> > --
> > -------------------------------------
> > Ing. Nikola CIPRICH
> > LinuxBox.cz, s.r.o.
> > 28.rijna 168, 709 00 Ostrava
> >
> > tel.:   +420 591 166 214
> > fax:    +420 596 621 273
> > mobil:  +420 777 093 799
> > www.linuxbox.cz
> >
> > mobil servis: +420 737 238 656
> > email servis: servis@xxxxxxxxxxx
> > -------------------------------------
> 

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@xxxxxxxxxxx
-------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx