Have you tried setting osd op queue cut off to high? Peter > Am 11.08.2021 um 15:24 schrieb Frank Schilder <frans@xxxxxx>: > > The recovery_sleep options are the next choice to look at. Increase it and clients will get more I/O time slots. However, with your settings, I'm surprised clients are impacted at all. I usually leave the op-priority at its default and use osd-max-backfill=2..4 for HDDs. With this, clients usually don't notice anything. I'm running mimic 13.2.10 though. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Nico Schottelius <nico.schottelius@xxxxxxxxxxx> > Sent: 11 August 2021 10:08:34 > To: Ceph Users > Subject: Very slow I/O during rebalance - options to tune? > > Good morning, > > after removing 3 osds which had been dead for some time, > rebalancing started this morning and makes client I/O really slow (in > the 10~30 MB/s area!). Rebalancing started at 1.2 ~ 1.6 Gb/s > after issuing > > ceph tell 'osd.*' injectargs --osd-max-backfills=1 --osd-recovery-max-active=1 --osd-recovery-op-priority=1 > > the rebalance came down ~800MB/s. > > The default osd-recovery-op-priority is 2 in our clusters, so way below > the client priority. > > In some moments (see ceph -s output below) some particular osds are > shown as slow, but that is not consistent to one host or one osd, but > seems to go through the cluster. > > Are there any other ways to prioritize client traffic over rebalancing? > > We don't want to stop the rebalance completely, but it seems that even > with above settings to sacrifice the client I/O completely. > > Our cluster version is 14.2.16. > > Best regards, > > Nico > > -------------------------------------------------------------------------------- > > cluster: > id: 1ccd84f6-e362-4c50-9ffe-59436745e445 > health: HEALTH_WARN > 5 slow ops, oldest one blocked for 155 sec, daemons [osd.12,osd.41] have slow ops. > > services: > mon: 5 daemons, quorum server2,server8,server6,server4,server18 (age 4w) > mgr: server2(active, since 4w), standbys: server4, server6, server8, server18 > osd: 104 osds: 104 up (since 46m), 104 in (since 4w); 365 remapped pgs > > data: > pools: 4 pools, 2624 pgs > objects: 47.67M objects, 181 TiB > usage: 550 TiB used, 215 TiB / 765 TiB avail > pgs: 6034480/142997898 objects misplaced (4.220%) > 2259 active+clean > 315 active+remapped+backfill_wait > 50 active+remapped+backfilling > > io: > client: 15 MiB/s rd, 26 MiB/s wr, 559 op/s rd, 617 op/s wr > recovery: 782 MiB/s, 196 objects/s > > ... and a little later: > > [10:06:32] server6.place6:~# ceph -s > cluster: > id: 1ccd84f6-e362-4c50-9ffe-59436745e445 > health: HEALTH_OK > > services: > mon: 5 daemons, quorum server2,server8,server6,server4,server18 (age 4w) > mgr: server2(active, since 4w), standbys: server4, server6, server8, server18 > osd: 104 osds: 104 up (since 59m), 104 in (since 4w); 349 remapped pgs > > data: > pools: 4 pools, 2624 pgs > objects: 47.67M objects, 181 TiB > usage: 550 TiB used, 214 TiB / 765 TiB avail > pgs: 5876676/143004876 objects misplaced (4.109%) > 2275 active+clean > 303 active+remapped+backfill_wait > 46 active+remapped+backfilling > > io: > client: 3.6 MiB/s rd, 25 MiB/s wr, 704 op/s rd, 726 op/s wr > recovery: 776 MiB/s, 0 keys/s, 195 objects/s > > > -- > Sustainable and modern Infrastructures by ungleich.ch > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx