Ceph 17.2.5, dockerized, Ubuntu 20.04, OSD on HDD with WAL/DB on SSD. Hi all, old topic, but the problem still exists. I tested it extensively, with osd_op_queue set either to mclock_scheduler (and profile set to high recovery) or wpq and the well known options (sleep_time, max_backfill) from https://docs.ceph.com/en/quincy/rados/configuration/osd-config-ref/ When removing an OSD with `ceph orch osd rm X` the backfilling always ends with a large number of misplaced objects at a low recovery rate (right now "120979/336643536 objects misplaced (0.036%); 10 KiB/s, 2 objects/s recovering"). The rate drops significantly when there are very few PGs involved.I wonder if someone a similar installation as we have (see above) doesn't experience this problem. Thanks, Erich Am Mo., 12. Dez. 2022 um 12:28 Uhr schrieb Frank Schilder <frans@xxxxxx>: > Hi Monish, > > you are probably on mclock scheduler, which ignores these settings. You > might want to set them back to defaults, change the scheduler to wpq and > then try again if it needs adjusting. there were several threads about > "broken" recovery ops scheduling with mclock in the latest versions. > > So, back to Eugen's answer: go through this list and try solutions of > earlier cases. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Monish Selvaraj <monish@xxxxxxxxxxxxxxx> > Sent: 12 December 2022 11:32:26 > To: Eugen Block > Cc: ceph-users@xxxxxxx > Subject: Re: Increase the recovery throughput > > Hi Eugen, > > We tried that already. the osd_max_backfills is in 24 and the > osd_recovery_max_active is in 20. > > On Mon, Dec 12, 2022 at 3:47 PM Eugen Block <eblock@xxxxxx> wrote: > > > Hi, > > > > there are many threads dicussing recovery throughput, have you tried > > any of the solutions? First thing to try is to increase > > osd_recovery_max_active and osd_max_backfills. What are the current > > values in your cluster? > > > > > > Zitat von Monish Selvaraj <monish@xxxxxxxxxxxxxxx>: > > > > > Hi, > > > > > > Our ceph cluster consists of 20 hosts and 240 osds. > > > > > > We used the erasure-coded pool with cache-pool concept. > > > > > > Some time back 2 hosts went down and the pg are in a degraded state. We > > got > > > the 2 hosts back up in some time. After the pg is started recovering > but > > it > > > takes a long time ( months ) . While this was happening we had the > > cluster > > > with 664.4 M objects and 987 TB data. The recovery status is not > changed; > > > it remains 88 pgs degraded. > > > > > > During this period, we increase the pg size from 256 to 512 for the > > > data-pool ( erasure-coded pool ). > > > > > > We also observed (one week ) the recovery to be very slow, the current > > > recovery around 750 Mibs. > > > > > > Is there any way to increase this recovery throughput ? > > > > > > *Ceph-version : quincy* > > > > > > [image: image.png] > > > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx