On Tue, Mar 21, 2023 at 2:21 PM Clyso GmbH - Ceph Foundation Member < joachim.kraftmayer@xxxxxxxxx> wrote: > > > https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue > Since this requires a restart I went an other way to speed up the recovery of degraded PGs and avoid weirdness while restarting the OSDs. I've increased the value of osd_mclock_max_capacity_iops_hdd to a ridiculous number for spinning disks (6000). The effect is not magical but the recovery went from 4 to 60 objects/s. Ceph should be back to normal in a few hours. I will change the osd_op_queue value once the cluster is stable. Thanks for the help, it's been really useful, and I know a little bit more about Ceph :) Gauvain > ___________________________________ > Clyso GmbH - Ceph Foundation Member > > Am 21.03.23 um 12:51 schrieb Gauvain Pocentek: > > (adding back the list) > > On Tue, Mar 21, 2023 at 11:25 AM Joachim Kraftmayer < > joachim.kraftmayer@xxxxxxxxx> wrote: > >> i added the questions and answers below. >> >> ___________________________________ >> Best Regards, >> Joachim Kraftmayer >> CEO | Clyso GmbH >> >> Clyso GmbH >> p: +49 89 21 55 23 91 2 >> a: Loristraße 8 | 80335 München | Germany >> w: https://clyso.com | e: joachim.kraftmayer@xxxxxxxxx >> >> We are hiring: https://www.clyso.com/jobs/ >> --- >> CEO: Dipl. Inf. (FH) Joachim Kraftmayer >> Unternehmenssitz: Utting am Ammersee >> Handelsregister beim Amtsgericht: Augsburg >> Handelsregister-Nummer: HRB 25866 >> USt. ID-Nr.: DE275430677 >> >> Am 21.03.23 um 11:14 schrieb Gauvain Pocentek: >> >> Hi Joachim, >> >> >> On Tue, Mar 21, 2023 at 10:13 AM Joachim Kraftmayer < >> joachim.kraftmayer@xxxxxxxxx> wrote: >> >>> Which Ceph version are you running, is mclock active? >>> >>> >> We're using Quincy (17.2.5), upgraded step by step from Luminous if I >> remember correctly. >> >> did you recreate the osds? if yes, at which version? >> > > I actually don't remember all the history, but I think we added the HDD > nodes while running Pacific. > > > >> >> mlock seems active, set to high_client_ops profile. HDD OSDs have very >> different settings for max capacity iops: >> >> osd.137 basic osd_mclock_max_capacity_iops_hdd >> 929.763899 >> osd.161 basic osd_mclock_max_capacity_iops_hdd >> 4754.250946 >> osd.222 basic osd_mclock_max_capacity_iops_hdd >> 540.016984 >> osd.281 basic osd_mclock_max_capacity_iops_hdd >> 1029.193945 >> osd.282 basic osd_mclock_max_capacity_iops_hdd >> 1061.762870 >> osd.283 basic osd_mclock_max_capacity_iops_hdd >> 462.984562 >> >> We haven't set those explicitly, could they be the reason of the slow >> recovery? >> >> i recommend to disable mclock for now, and yes we have seen slow recovery >> caused by mclock. >> > > Stupid question: how do you do that? I've looked through the docs but > could only find information about changing the settings. > > >> >> >> Bonus question: does ceph set that itself? >> >> yes and if you have a setup with HDD + SSD (db & wal) the discovery works >> not in the right way. >> > > Good to know! > > > Gauvain > > >> >> Thanks! >> >> Gauvain >> >> >> >> >>> Joachim >>> >>> ___________________________________ >>> Clyso GmbH - Ceph Foundation Member >>> >>> Am 21.03.23 um 06:53 schrieb Gauvain Pocentek: >>> > Hello all, >>> > >>> > We have an EC (4+2) pool for RGW data, with HDDs + SSDs for WAL/DB. >>> This >>> > pool has 9 servers with each 12 disks of 16TBs. About 10 days ago we >>> lost a >>> > server and we've removed its OSDs from the cluster. Ceph has started to >>> > remap and backfill as expected, but the process has been getting >>> slower and >>> > slower. Today the recovery rate is around 12 MiB/s and 10 objects/s. >>> All >>> > the remaining unclean PGs are backfilling: >>> > >>> > data: >>> > volumes: 1/1 healthy >>> > pools: 14 pools, 14497 pgs >>> > objects: 192.38M objects, 380 TiB >>> > usage: 764 TiB used, 1.3 PiB / 2.1 PiB avail >>> > pgs: 771559/1065561630 objects degraded (0.072%) >>> > 1215899/1065561630 objects misplaced (0.114%) >>> > 14428 active+clean >>> > 50 active+undersized+degraded+remapped+backfilling >>> > 18 active+remapped+backfilling >>> > 1 active+clean+scrubbing+deep >>> > >>> > We've checked the health of the remaining servers, and everything looks >>> > like (CPU/RAM/network/disks). >>> > >>> > Any hints on what could be happening? >>> > >>> > Thank you, >>> > Gauvain >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users@xxxxxxx >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx