I recently got mclock going literally an order of magnitude faster. I would love to claim I found all the options myself but I collected the knowledge of what knobs I needed to turn from here. Steps I took: - Cleared all osd specific osd_mclock_max_capacity_iops settings. The auto created ones were all over the place. Some rust drives claimed 200 and others well over 5000. - Set sane global osd_mclock_max_capacity_iops_hdd and osd_mclock_max_capacity_iops_ssd numbers for my average lowest performance drive performance in my environment (your numbers will be different. These are for 18t SAS seagate rust drives and micron 9100 6.4t NVMe) - osd basic osd_mclock_max_capacity_iops_hdd 375.000000 - osd basic osd_mclock_max_capacity_iops_ssd 575000.000000 - Set the profile to what I wanted my global default to be. - osd advanced osd_mclock_profile high_client_ops - Tweaked the costs of doing operations -osd dev osd_mclock_cost_per_byte_usec_hdd 1.000000 - osd dev osd_mclock_cost_per_byte_usec_ssd 0.005000 I need to revisit the cost per byte settings. Originally I was using just this knob to play with speeds but I quickly starting getting many slow ops along with faster speeds. Then I pulled the max capacity iops down from 400 and finally settled where I am now. I have room for improvement here but this is my prod cluster so.. yeah. - Next I set specific faster drives to their own specific max capacity iops (optane drives I have for the metadata tier) - e.g. osd.450 basic osd_mclock_max_capacity_iops_ssd 785000.000000 - I also set the profile to specific drives in a tier I'm migrating to new spinners to "balanced" to speed that up. - e.g. osd.789 advanced osd_mclock_profile balanced I think that's about it. I was not scientific AT ALL with this. I just kept turning knobs a little and watching the recovery throughput and healthometer. On my cold EC tier rebalance I went from something like 150MB/s 20 obj/s to 2.1GB/s 750 obj/s. I know I'm pushing these drives pretty hard because I'm watching different drives claim 0 slow ops for N seconds, then a few min later clear. My replicated tier now recovers ridiculously fast as well. I'm looking forward to pulling all of this out and having ceph DoTheRightThing(tm) with recovery speeds. We shall see. -paul -- Paul Mezzanini Platform Engineer III Research Computing Rochester Institute of Technology “End users is a description, not a goal.” ________________________________________ From: Dan van der Ster <dan.vanderster@xxxxxxxxx> Sent: Thursday, July 6, 2023 6:04 PM To: Jesper Krogh Cc: ceph-users@xxxxxxx Subject: Re: Cannot get backfill speed up Hi Jesper, Indeed many users reported slow backfilling and recovery with the mclock scheduler. This is supposed to be fixed in the latest quincy but clearly something is still slowing things down. Some clusters have better luck reverting to osd_op_queue = wpq. (I'm hoping by proposing this someone who tuned mclock recently will chime in with better advice). Cheers, Dan ______________________________________________________ Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com On Wed, Jul 5, 2023 at 10:28 PM Jesper Krogh <jesper@xxxxxxxx> wrote: > > Hi. > > Fresh cluster - but despite setting: > jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 | grep > recovery_max_active_ssd > osd_recovery_max_active_ssd 50 > > mon > default[20] > jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 | grep > osd_max_backfills > osd_max_backfills 100 > > mon > default[10] > > I still get > jskr@dkcphhpcmgt028:/$ sudo ceph status > cluster: > id: 5c384430-da91-11ed-af9c-c780a5227aff > health: HEALTH_OK > > services: > mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028 > (age 16h) > mgr: dkcphhpcmgt031.afbgjx(active, since 33h), standbys: > dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd > mds: 2/2 daemons up, 1 standby > osd: 40 osds: 40 up (since 45h), 40 in (since 39h); 21 remapped pgs > > data: > volumes: 2/2 healthy > pools: 9 pools, 495 pgs > objects: 24.85M objects, 60 TiB > usage: 117 TiB used, 159 TiB / 276 TiB avail > pgs: 10655690/145764002 objects misplaced (7.310%) > 474 active+clean > 15 active+remapped+backfilling > 6 active+remapped+backfill_wait > > io: > client: 0 B/s rd, 1.4 MiB/s wr, 0 op/s rd, 116 op/s wr > recovery: 328 MiB/s, 108 objects/s > > progress: > Global Recovery Event (9h) > [==========================..] (remaining: 25m) > > With these numbers for the setting - I would expect to get more than 15 > active backfilling... (and based on SSD's and 2x25gbit network, I can > also spend more resources on recovery than 328 MiB/s > > Thanks, . > > -- > Jesper Krogh > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx