Re: Ceph recovery network speed

Curt <lightspd@xxxxxxxxx> · Mon, 27 Jun 2022 21:41:06 +0400

I would love to see those types of speeds. I tried setting it all the way
to 0 and nothing, I did that before I sent the first email, maybe it was
your old post I got it from.

osd_recovery_sleep_hdd                           0.000000

                                override  (mon[0.000000])

On Mon, Jun 27, 2022 at 9:27 PM Robert Gallop <robert.gallop@xxxxxxxxx>
wrote:

> I saw a major boost after having the sleep_hdd set to 0.  Only after that
> did I start staying at around 500MiB to 1.2GiB/sec and 1.5k obj/sec to 2.5k
> obj/sec.
>
> Eventually it tapered back down, but for me sleep was the key, and
> specifically in my case:
>
> osd_recovery_sleep_hdd
>
> On Mon, Jun 27, 2022 at 11:17 AM Curt <lightspd@xxxxxxxxx> wrote:
>
>> On Mon, Jun 27, 2022 at 8:52 PM Frank Schilder <frans@xxxxxx> wrote:
>>
>> > I think this is just how ceph is. Maybe you should post the output of
>> > "ceph status", "ceph osd pool stats" and "ceph df" so that we can get an
>> > idea whether what you look at is expected or not. As I wrote before,
>> object
>> > recovery is throttled and the recovery bandwidth depends heavily on
>> object
>> > size. The interesting question is, how many objects per second are
>> > recovered/rebalanced
>> >
>>  data:
>>     pools:   11 pools, 369 pgs
>>     objects: 2.45M objects, 9.2 TiB
>>     usage:   20 TiB used, 60 TiB / 80 TiB avail
>>     pgs:     512136/9729081 objects misplaced (5.264%)
>>              343 active+clean
>>              22  active+remapped+backfilling
>>
>>   io:
>>     client:   2.0 MiB/s rd, 344 KiB/s wr, 142 op/s rd, 69 op/s wr
>>     recovery: 34 MiB/s, 8 objects/s
>>
>> Pool 12 is the only one with any stats.
>>
>> pool EC-22-Pool id 12
>>   510048/9545052 objects misplaced (5.344%)
>>   recovery io 36 MiB/s, 9 objects/s
>>   client io 1.8 MiB/s rd, 404 KiB/s wr, 86 op/s rd, 72 op/s wr
>>
>> --- RAW STORAGE ---
>> CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
>> hdd    80 TiB  60 TiB  20 TiB    20 TiB      25.45
>> TOTAL  80 TiB  60 TiB  20 TiB    20 TiB      25.45
>>
>> --- POOLS ---
>> POOL                        ID  PGS   STORED  OBJECTS     USED  %USED  MAX
>> AVAIL
>> .mgr                         1    1  152 MiB       38  457 MiB      0
>>  9.2 TiB
>> 21BadPool                    3   32    8 KiB        1   12 KiB      0
>> 18 TiB
>> .rgw.root                    4   32  1.3 KiB        4   48 KiB      0
>>  9.2 TiB
>> default.rgw.log              5   32  3.6 KiB      209  408 KiB      0
>>  9.2 TiB
>> default.rgw.control          6   32      0 B        8      0 B      0
>>  9.2 TiB
>> default.rgw.meta             7    8  6.7 KiB       20  203 KiB      0
>>  9.2 TiB
>> rbd_rep_pool                 8   32  2.0 MiB        5  5.9 MiB      0
>>  9.2 TiB
>> default.rgw.buckets.index    9    8  2.0 MiB       33  5.9 MiB      0
>>  9.2 TiB
>> default.rgw.buckets.non-ec  10   32  1.4 KiB        0  4.3 KiB      0
>>  9.2 TiB
>> default.rgw.buckets.data    11   32  232 GiB   61.02k  697 GiB   2.41
>>  9.2 TiB
>> EC-22-Pool                  12  128  9.8 TiB    2.39M   20 TiB  41.55
>> 14 TiB
>>
>>
>>
>> > Maybe provide the output of the first two commands for
>> > osd_recovery_sleep_hdd=0.05 and osd_recovery_sleep_hdd=0.1 each (wait a
>> bit
>> > after setting these and then collect the output). Include the applied
>> > values for osd_max_backfills* and osd_recovery_max_active* for one of
>> the
>> > OSDs in the pool (ceph config show osd.ID | grep -e osd_max_backfills -e
>> > osd_recovery_max_active).
>> >
>>
>> I didn't notice any speed difference with sleep values changed, but I'll
>> grab the stats between changes when I have a chance.
>>
>> ceph config show osd.19 | egrep
>> 'osd_max_backfills|osd_recovery_max_active'
>> osd_max_backfills                                1000
>>
>>
>>                                 override  mon[5]
>> osd_recovery_max_active                          1000
>>
>>
>>                                 override
>> osd_recovery_max_active_hdd                      1000
>>
>>
>>                                 override  mon[5]
>> osd_recovery_max_active_ssd                      1000
>>
>>
>>                                 override
>>
>> >
>> > I don't really know if on such a small cluster one can expect more than
>> > what you see. It has nothing to do with network speed if you have a 10G
>> > line. However, recovery is something completely different from a full
>> > link-speed copy.
>> >
>> > I can tell you that boatloads of tiny objects are a huge pain for
>> > recovery, even on SSD. Ceph doesn't raid up sections of disks against
>> each
>> > other, but object for object. This might be a feature request: that PG
>> > space allocation and recovery should follow the model of LVM extends
>> > (ideally match with LVM extends) to allow recovery/rebalancing larger
>> > chunks of storage in one go, containing parts of a large or many small
>> > objects.
>> >
>> > Best regards,
>> > =================
>> > Frank Schilder
>> > AIT Risø Campus
>> > Bygning 109, rum S14
>> >
>> > ________________________________________
>> > From: Curt <lightspd@xxxxxxxxx>
>> > Sent: 27 June 2022 17:35:19
>> > To: Frank Schilder
>> > Cc: ceph-users@xxxxxxx
>> > Subject: Re:  Re: Ceph recovery network speed
>> >
>> > Hello,
>> >
>> > I had already increased/changed those variables previously.  I increased
>> > the pg_num to 128. Which increased the number of PG's backfilling, but
>> > speed is still only at 30 MiB/s avg and has been backfilling 23 pg for
>> the
>> > last several hours.  Should I increase it higher than 128?
>> >
>> > I'm still trying to figure out if this is just how ceph is or if there
>> is
>> > a bottleneck somewhere.  Like if I sftp a 10G file between servers it's
>> > done in a couple min or less.  Am I thinking of this wrong?
>> >
>> > Thanks,
>> > Curt
>> >
>> > On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder <frans@xxxxxx<mailto:
>> > frans@xxxxxx>> wrote:
>> > Hi Curt,
>> >
>> > as far as I understood, a 2+2 EC pool is recovering, which makes 1 OSD
>> per
>> > host busy. My experience is, that the algorithm for selecting PGs to
>> > backfill/recover is not very smart. It could simply be that it doesn't
>> find
>> > more PGs without violating some of these settings:
>> >
>> > osd_max_backfills
>> > osd_recovery_max_active
>> >
>> > I have never observed the second parameter to change anything (try any
>> > ways). However, the first one has a large impact. You could try
>> increasing
>> > this slowly until recovery moves faster. Another parameter you might
>> want
>> > to try is
>> >
>> > osd_recovery_sleep_[hdd|ssd]
>> >
>> > Be careful as this will impact client IO. I could reduce the sleep for
>> my
>> > HDDs to 0.05. With your workload pattern, this might be something you
>> can
>> > tune as well.
>> >
>> > Having said that, I think you should increase your PG count on the EC
>> pool
>> > as soon as the cluster is healthy. You have only about 20 PGs per OSD
>> and
>> > large PGs will take unnecessarily long to recover. A higher PG count
>> will
>> > also make it easier for the scheduler to find PGs for recovery/backfill.
>> > Aim for a number between 100 and 200. Give the pool(s) with most data
>> > (#objects) the most PGs.
>> >
>> > Best regards,
>> > =================
>> > Frank Schilder
>> > AIT Risø Campus
>> > Bygning 109, rum S14
>> >
>> > ________________________________________
>> > From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>
>> > Sent: 24 June 2022 19:04
>> > To: Anthony D'Atri; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
>> > Subject:  Re: Ceph recovery network speed
>> >
>> > 2 PG's shouldn't take hours to backfill in my opinion.  Just 2TB
>> enterprise
>> > HD's.
>> >
>> > Take this log entry below, 72 minutes and still backfilling undersized?
>> > Should it be that slow?
>> >
>> > pg 12.15 is stuck undersized for 72m, current state
>> > active+undersized+degraded+remapped+backfilling, last acting
>> > [34,10,29,NONE]
>> >
>> > Thanks,
>> > Curt
>> >
>> >
>> > On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri <anthony.datri@xxxxxxxxx
>> > <mailto:anthony.datri@xxxxxxxxx>>
>> > wrote:
>> >
>> > > Your recovery is slow *because* there are only 2 PGs backfilling.
>> > >
>> > > What kind of OSD media are you using?
>> > >
>> > > > On Jun 24, 2022, at 09:46, Curt <lightspd@xxxxxxxxx<mailto:
>> > lightspd@xxxxxxxxx>> wrote:
>> > > >
>> > > > Hello,
>> > > >
>> > > > I'm trying to understand why my recovery is so slow with only 2 pg
>> > > > backfilling.  I'm only getting speeds of 3-4/MiB/s on a 10G
>> network.  I
>> > > > have tested the speed between machines with a few tools and all
>> confirm
>> > > 10G
>> > > > speed.  I've tried changing various settings of priority and
>> recovery
>> > > sleep
>> > > > hdd, but still the same. Is this a configuration issue or something
>> > else?
>> > > >
>> > > > It's just a small cluster right now with 4 hosts, 11 osd's per.
>> Please
>> > > let
>> > > > me know if you need more information.
>> > > >
>> > > > Thanks,
>> > > > Curt
>> > > > _______________________________________________
>> > > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
>> > ceph-users@xxxxxxx>
>> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
>> > ceph-users-leave@xxxxxxx>
>> > >
>> > >
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
>> >
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
>> > ceph-users-leave@xxxxxxx>
>> >
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx