Re: Cannot get backfill speed up

Paul Mezzanini <pfmeec@xxxxxxx> · Fri, 7 Jul 2023 11:42:41 +0000

I recently got mclock going literally an order of magnitude faster.  I would love to claim I found all the options myself but I collected the knowledge of what knobs I needed to turn from here.

Steps I took:
- Cleared all osd specific osd_mclock_max_capacity_iops settings.  The auto created ones were all over the place.  Some rust drives claimed 200 and others well over 5000.

- Set sane global osd_mclock_max_capacity_iops_hdd and osd_mclock_max_capacity_iops_ssd numbers for my average lowest performance drive performance in my environment  (your numbers will be different.  These are for 18t SAS seagate rust drives and micron 9100 6.4t NVMe)
     - osd                           basic     osd_mclock_max_capacity_iops_hdd               375.000000
     - osd                           basic     osd_mclock_max_capacity_iops_ssd                575000.000000

- Set the profile to what I wanted my global default to be.
     - osd                           advanced  osd_mclock_profile                              high_client_ops

- Tweaked the costs of doing operations  
      -osd                           dev       osd_mclock_cost_per_byte_usec_hdd               1.000000
     - osd                           dev       osd_mclock_cost_per_byte_usec_ssd               0.005000

I need to revisit the cost per byte settings.  Originally I was using just this knob to play with speeds but I quickly starting getting many slow ops along with faster speeds.   Then I pulled the max capacity iops down from 400 and finally settled where I am now.  I have room for improvement here but this is my prod cluster so.. yeah.

- Next I set specific faster drives to their own specific max capacity iops (optane drives I have for the metadata tier)
     - e.g.   osd.450                       basic     osd_mclock_max_capacity_iops_ssd                785000.000000

- I also set the profile to specific drives in a tier I'm migrating to new spinners to "balanced" to speed that up.
     - e.g.    osd.789                       advanced  osd_mclock_profile                              balanced

I think that's about it.  I was not scientific AT ALL with this.   I just kept turning knobs a little and watching the recovery throughput and healthometer.  On my cold EC tier rebalance I went from something like 150MB/s 20 obj/s to 2.1GB/s 750 obj/s.  I know I'm pushing these drives pretty hard because I'm watching different drives claim 0 slow ops for N seconds, then a few min later clear.  My replicated tier now recovers ridiculously fast as well.

I'm looking forward to pulling all of this out and having ceph DoTheRightThing(tm) with recovery speeds.  We shall see.

-paul
-- 

Paul Mezzanini
Platform Engineer III
Research Computing

Rochester Institute of Technology

 “End users is a description, not a goal.”

________________________________________
From: Dan van der Ster <dan.vanderster@xxxxxxxxx>
Sent: Thursday, July 6, 2023 6:04 PM
To: Jesper Krogh
Cc: ceph-users@xxxxxxx
Subject:  Re: Cannot get backfill speed up

Hi Jesper,

Indeed many users reported slow backfilling and recovery with the mclock
scheduler. This is supposed to be fixed in the latest quincy but clearly
something is still slowing things down.
Some clusters have better luck reverting to osd_op_queue = wpq.

(I'm hoping by proposing this someone who tuned mclock recently will chime
in with better advice).

Cheers, Dan

______________________________________________________
Clyso GmbH | Ceph Support and Consulting | https://www.clyso.com

On Wed, Jul 5, 2023 at 10:28 PM Jesper Krogh <jesper@xxxxxxxx> wrote:

>
> Hi.
>
> Fresh cluster - but despite setting:
> jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 |  grep
> recovery_max_active_ssd
> osd_recovery_max_active_ssd                      50
>
>                                                        mon
> default[20]
> jskr@dkcphhpcmgt028:/$ sudo ceph config show osd.0 |  grep
> osd_max_backfills
> osd_max_backfills                                100
>
>                                                        mon
> default[10]
>
> I still get
> jskr@dkcphhpcmgt028:/$ sudo ceph status
>    cluster:
>      id:     5c384430-da91-11ed-af9c-c780a5227aff
>      health: HEALTH_OK
>
>    services:
>      mon: 3 daemons, quorum dkcphhpcmgt031,dkcphhpcmgt029,dkcphhpcmgt028
> (age 16h)
>      mgr: dkcphhpcmgt031.afbgjx(active, since 33h), standbys:
> dkcphhpcmgt029.bnsegi, dkcphhpcmgt028.bxxkqd
>      mds: 2/2 daemons up, 1 standby
>      osd: 40 osds: 40 up (since 45h), 40 in (since 39h); 21 remapped pgs
>
>    data:
>      volumes: 2/2 healthy
>      pools:   9 pools, 495 pgs
>      objects: 24.85M objects, 60 TiB
>      usage:   117 TiB used, 159 TiB / 276 TiB avail
>      pgs:     10655690/145764002 objects misplaced (7.310%)
>               474 active+clean
>               15  active+remapped+backfilling
>               6   active+remapped+backfill_wait
>
>    io:
>      client:   0 B/s rd, 1.4 MiB/s wr, 0 op/s rd, 116 op/s wr
>      recovery: 328 MiB/s, 108 objects/s
>
>    progress:
>      Global Recovery Event (9h)
>        [==========================..] (remaining: 25m)
>
> With these numbers for the setting - I would expect to get more than 15
> active backfilling... (and based on SSD's and 2x25gbit network, I can
> also spend more resources on recovery than 328 MiB/s
>
> Thanks, .
>
> --
> Jesper Krogh
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx