Re: Ceph recovery network speed

Robert Gallop <robert.gallop@xxxxxxxxx> · Mon, 27 Jun 2022 11:27:23 -0600

I saw a major boost after having the sleep_hdd set to 0.  Only after that
did I start staying at around 500MiB to 1.2GiB/sec and 1.5k obj/sec to 2.5k
obj/sec.

Eventually it tapered back down, but for me sleep was the key, and
specifically in my case:

osd_recovery_sleep_hdd

On Mon, Jun 27, 2022 at 11:17 AM Curt <lightspd@xxxxxxxxx> wrote:

> On Mon, Jun 27, 2022 at 8:52 PM Frank Schilder <frans@xxxxxx> wrote:
>
> > I think this is just how ceph is. Maybe you should post the output of
> > "ceph status", "ceph osd pool stats" and "ceph df" so that we can get an
> > idea whether what you look at is expected or not. As I wrote before,
> object
> > recovery is throttled and the recovery bandwidth depends heavily on
> object
> > size. The interesting question is, how many objects per second are
> > recovered/rebalanced
> >
>  data:
>     pools:   11 pools, 369 pgs
>     objects: 2.45M objects, 9.2 TiB
>     usage:   20 TiB used, 60 TiB / 80 TiB avail
>     pgs:     512136/9729081 objects misplaced (5.264%)
>              343 active+clean
>              22  active+remapped+backfilling
>
>   io:
>     client:   2.0 MiB/s rd, 344 KiB/s wr, 142 op/s rd, 69 op/s wr
>     recovery: 34 MiB/s, 8 objects/s
>
> Pool 12 is the only one with any stats.
>
> pool EC-22-Pool id 12
>   510048/9545052 objects misplaced (5.344%)
>   recovery io 36 MiB/s, 9 objects/s
>   client io 1.8 MiB/s rd, 404 KiB/s wr, 86 op/s rd, 72 op/s wr
>
> --- RAW STORAGE ---
> CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
> hdd    80 TiB  60 TiB  20 TiB    20 TiB      25.45
> TOTAL  80 TiB  60 TiB  20 TiB    20 TiB      25.45
>
> --- POOLS ---
> POOL                        ID  PGS   STORED  OBJECTS     USED  %USED  MAX
> AVAIL
> .mgr                         1    1  152 MiB       38  457 MiB      0
>  9.2 TiB
> 21BadPool                    3   32    8 KiB        1   12 KiB      0
> 18 TiB
> .rgw.root                    4   32  1.3 KiB        4   48 KiB      0
>  9.2 TiB
> default.rgw.log              5   32  3.6 KiB      209  408 KiB      0
>  9.2 TiB
> default.rgw.control          6   32      0 B        8      0 B      0
>  9.2 TiB
> default.rgw.meta             7    8  6.7 KiB       20  203 KiB      0
>  9.2 TiB
> rbd_rep_pool                 8   32  2.0 MiB        5  5.9 MiB      0
>  9.2 TiB
> default.rgw.buckets.index    9    8  2.0 MiB       33  5.9 MiB      0
>  9.2 TiB
> default.rgw.buckets.non-ec  10   32  1.4 KiB        0  4.3 KiB      0
>  9.2 TiB
> default.rgw.buckets.data    11   32  232 GiB   61.02k  697 GiB   2.41
>  9.2 TiB
> EC-22-Pool                  12  128  9.8 TiB    2.39M   20 TiB  41.55
> 14 TiB
>
>
>
> > Maybe provide the output of the first two commands for
> > osd_recovery_sleep_hdd=0.05 and osd_recovery_sleep_hdd=0.1 each (wait a
> bit
> > after setting these and then collect the output). Include the applied
> > values for osd_max_backfills* and osd_recovery_max_active* for one of the
> > OSDs in the pool (ceph config show osd.ID | grep -e osd_max_backfills -e
> > osd_recovery_max_active).
> >
>
> I didn't notice any speed difference with sleep values changed, but I'll
> grab the stats between changes when I have a chance.
>
> ceph config show osd.19 | egrep 'osd_max_backfills|osd_recovery_max_active'
> osd_max_backfills                                1000
>
>
>                                 override  mon[5]
> osd_recovery_max_active                          1000
>
>
>                                 override
> osd_recovery_max_active_hdd                      1000
>
>
>                                 override  mon[5]
> osd_recovery_max_active_ssd                      1000
>
>
>                                 override
>
> >
> > I don't really know if on such a small cluster one can expect more than
> > what you see. It has nothing to do with network speed if you have a 10G
> > line. However, recovery is something completely different from a full
> > link-speed copy.
> >
> > I can tell you that boatloads of tiny objects are a huge pain for
> > recovery, even on SSD. Ceph doesn't raid up sections of disks against
> each
> > other, but object for object. This might be a feature request: that PG
> > space allocation and recovery should follow the model of LVM extends
> > (ideally match with LVM extends) to allow recovery/rebalancing larger
> > chunks of storage in one go, containing parts of a large or many small
> > objects.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Curt <lightspd@xxxxxxxxx>
> > Sent: 27 June 2022 17:35:19
> > To: Frank Schilder
> > Cc: ceph-users@xxxxxxx
> > Subject: Re:  Re: Ceph recovery network speed
> >
> > Hello,
> >
> > I had already increased/changed those variables previously.  I increased
> > the pg_num to 128. Which increased the number of PG's backfilling, but
> > speed is still only at 30 MiB/s avg and has been backfilling 23 pg for
> the
> > last several hours.  Should I increase it higher than 128?
> >
> > I'm still trying to figure out if this is just how ceph is or if there is
> > a bottleneck somewhere.  Like if I sftp a 10G file between servers it's
> > done in a couple min or less.  Am I thinking of this wrong?
> >
> > Thanks,
> > Curt
> >
> > On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder <frans@xxxxxx<mailto:
> > frans@xxxxxx>> wrote:
> > Hi Curt,
> >
> > as far as I understood, a 2+2 EC pool is recovering, which makes 1 OSD
> per
> > host busy. My experience is, that the algorithm for selecting PGs to
> > backfill/recover is not very smart. It could simply be that it doesn't
> find
> > more PGs without violating some of these settings:
> >
> > osd_max_backfills
> > osd_recovery_max_active
> >
> > I have never observed the second parameter to change anything (try any
> > ways). However, the first one has a large impact. You could try
> increasing
> > this slowly until recovery moves faster. Another parameter you might want
> > to try is
> >
> > osd_recovery_sleep_[hdd|ssd]
> >
> > Be careful as this will impact client IO. I could reduce the sleep for my
> > HDDs to 0.05. With your workload pattern, this might be something you can
> > tune as well.
> >
> > Having said that, I think you should increase your PG count on the EC
> pool
> > as soon as the cluster is healthy. You have only about 20 PGs per OSD and
> > large PGs will take unnecessarily long to recover. A higher PG count will
> > also make it easier for the scheduler to find PGs for recovery/backfill.
> > Aim for a number between 100 and 200. Give the pool(s) with most data
> > (#objects) the most PGs.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>
> > Sent: 24 June 2022 19:04
> > To: Anthony D'Atri; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> > Subject:  Re: Ceph recovery network speed
> >
> > 2 PG's shouldn't take hours to backfill in my opinion.  Just 2TB
> enterprise
> > HD's.
> >
> > Take this log entry below, 72 minutes and still backfilling undersized?
> > Should it be that slow?
> >
> > pg 12.15 is stuck undersized for 72m, current state
> > active+undersized+degraded+remapped+backfilling, last acting
> > [34,10,29,NONE]
> >
> > Thanks,
> > Curt
> >
> >
> > On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri <anthony.datri@xxxxxxxxx
> > <mailto:anthony.datri@xxxxxxxxx>>
> > wrote:
> >
> > > Your recovery is slow *because* there are only 2 PGs backfilling.
> > >
> > > What kind of OSD media are you using?
> > >
> > > > On Jun 24, 2022, at 09:46, Curt <lightspd@xxxxxxxxx<mailto:
> > lightspd@xxxxxxxxx>> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I'm trying to understand why my recovery is so slow with only 2 pg
> > > > backfilling.  I'm only getting speeds of 3-4/MiB/s on a 10G
> network.  I
> > > > have tested the speed between machines with a few tools and all
> confirm
> > > 10G
> > > > speed.  I've tried changing various settings of priority and recovery
> > > sleep
> > > > hdd, but still the same. Is this a configuration issue or something
> > else?
> > > >
> > > > It's just a small cluster right now with 4 hosts, 11 osd's per.
> Please
> > > let
> > > > me know if you need more information.
> > > >
> > > > Thanks,
> > > > Curt
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
> > ceph-users@xxxxxxx>
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> > ceph-users-leave@xxxxxxx>
> > >
> > >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> > ceph-users-leave@xxxxxxx>
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx