Re: Ceph recovery network speed

Curt <lightspd@xxxxxxxxx> · Mon, 27 Jun 2022 21:16:13 +0400

On Mon, Jun 27, 2022 at 8:52 PM Frank Schilder <frans@xxxxxx> wrote:

> I think this is just how ceph is. Maybe you should post the output of
> "ceph status", "ceph osd pool stats" and "ceph df" so that we can get an
> idea whether what you look at is expected or not. As I wrote before, object
> recovery is throttled and the recovery bandwidth depends heavily on object
> size. The interesting question is, how many objects per second are
> recovered/rebalanced
>
 data:
    pools:   11 pools, 369 pgs
    objects: 2.45M objects, 9.2 TiB
    usage:   20 TiB used, 60 TiB / 80 TiB avail
    pgs:     512136/9729081 objects misplaced (5.264%)
             343 active+clean
             22  active+remapped+backfilling

  io:
    client:   2.0 MiB/s rd, 344 KiB/s wr, 142 op/s rd, 69 op/s wr
    recovery: 34 MiB/s, 8 objects/s

Pool 12 is the only one with any stats.

pool EC-22-Pool id 12
  510048/9545052 objects misplaced (5.344%)
  recovery io 36 MiB/s, 9 objects/s
  client io 1.8 MiB/s rd, 404 KiB/s wr, 86 op/s rd, 72 op/s wr

--- RAW STORAGE ---
CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
hdd    80 TiB  60 TiB  20 TiB    20 TiB      25.45
TOTAL  80 TiB  60 TiB  20 TiB    20 TiB      25.45

--- POOLS ---
POOL                        ID  PGS   STORED  OBJECTS     USED  %USED  MAX
AVAIL
.mgr                         1    1  152 MiB       38  457 MiB      0
 9.2 TiB
21BadPool                    3   32    8 KiB        1   12 KiB      0
18 TiB
.rgw.root                    4   32  1.3 KiB        4   48 KiB      0
 9.2 TiB
default.rgw.log              5   32  3.6 KiB      209  408 KiB      0
 9.2 TiB
default.rgw.control          6   32      0 B        8      0 B      0
 9.2 TiB
default.rgw.meta             7    8  6.7 KiB       20  203 KiB      0
 9.2 TiB
rbd_rep_pool                 8   32  2.0 MiB        5  5.9 MiB      0
 9.2 TiB
default.rgw.buckets.index    9    8  2.0 MiB       33  5.9 MiB      0
 9.2 TiB
default.rgw.buckets.non-ec  10   32  1.4 KiB        0  4.3 KiB      0
 9.2 TiB
default.rgw.buckets.data    11   32  232 GiB   61.02k  697 GiB   2.41
 9.2 TiB
EC-22-Pool                  12  128  9.8 TiB    2.39M   20 TiB  41.55
14 TiB

> Maybe provide the output of the first two commands for
> osd_recovery_sleep_hdd=0.05 and osd_recovery_sleep_hdd=0.1 each (wait a bit
> after setting these and then collect the output). Include the applied
> values for osd_max_backfills* and osd_recovery_max_active* for one of the
> OSDs in the pool (ceph config show osd.ID | grep -e osd_max_backfills -e
> osd_recovery_max_active).
>

I didn't notice any speed difference with sleep values changed, but I'll
grab the stats between changes when I have a chance.

ceph config show osd.19 | egrep 'osd_max_backfills|osd_recovery_max_active'
osd_max_backfills                                1000

                                override  mon[5]
osd_recovery_max_active                          1000

                                override
osd_recovery_max_active_hdd                      1000

                                override  mon[5]
osd_recovery_max_active_ssd                      1000

                                override

>
> I don't really know if on such a small cluster one can expect more than
> what you see. It has nothing to do with network speed if you have a 10G
> line. However, recovery is something completely different from a full
> link-speed copy.
>
> I can tell you that boatloads of tiny objects are a huge pain for
> recovery, even on SSD. Ceph doesn't raid up sections of disks against each
> other, but object for object. This might be a feature request: that PG
> space allocation and recovery should follow the model of LVM extends
> (ideally match with LVM extends) to allow recovery/rebalancing larger
> chunks of storage in one go, containing parts of a large or many small
> objects.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Curt <lightspd@xxxxxxxxx>
> Sent: 27 June 2022 17:35:19
> To: Frank Schilder
> Cc: ceph-users@xxxxxxx
> Subject: Re:  Re: Ceph recovery network speed
>
> Hello,
>
> I had already increased/changed those variables previously.  I increased
> the pg_num to 128. Which increased the number of PG's backfilling, but
> speed is still only at 30 MiB/s avg and has been backfilling 23 pg for the
> last several hours.  Should I increase it higher than 128?
>
> I'm still trying to figure out if this is just how ceph is or if there is
> a bottleneck somewhere.  Like if I sftp a 10G file between servers it's
> done in a couple min or less.  Am I thinking of this wrong?
>
> Thanks,
> Curt
>
> On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx>> wrote:
> Hi Curt,
>
> as far as I understood, a 2+2 EC pool is recovering, which makes 1 OSD per
> host busy. My experience is, that the algorithm for selecting PGs to
> backfill/recover is not very smart. It could simply be that it doesn't find
> more PGs without violating some of these settings:
>
> osd_max_backfills
> osd_recovery_max_active
>
> I have never observed the second parameter to change anything (try any
> ways). However, the first one has a large impact. You could try increasing
> this slowly until recovery moves faster. Another parameter you might want
> to try is
>
> osd_recovery_sleep_[hdd|ssd]
>
> Be careful as this will impact client IO. I could reduce the sleep for my
> HDDs to 0.05. With your workload pattern, this might be something you can
> tune as well.
>
> Having said that, I think you should increase your PG count on the EC pool
> as soon as the cluster is healthy. You have only about 20 PGs per OSD and
> large PGs will take unnecessarily long to recover. A higher PG count will
> also make it easier for the scheduler to find PGs for recovery/backfill.
> Aim for a number between 100 and 200. Give the pool(s) with most data
> (#objects) the most PGs.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>
> Sent: 24 June 2022 19:04
> To: Anthony D'Atri; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> Subject:  Re: Ceph recovery network speed
>
> 2 PG's shouldn't take hours to backfill in my opinion.  Just 2TB enterprise
> HD's.
>
> Take this log entry below, 72 minutes and still backfilling undersized?
> Should it be that slow?
>
> pg 12.15 is stuck undersized for 72m, current state
> active+undersized+degraded+remapped+backfilling, last acting
> [34,10,29,NONE]
>
> Thanks,
> Curt
>
>
> On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri <anthony.datri@xxxxxxxxx
> <mailto:anthony.datri@xxxxxxxxx>>
> wrote:
>
> > Your recovery is slow *because* there are only 2 PGs backfilling.
> >
> > What kind of OSD media are you using?
> >
> > > On Jun 24, 2022, at 09:46, Curt <lightspd@xxxxxxxxx<mailto:
> lightspd@xxxxxxxxx>> wrote:
> > >
> > > Hello,
> > >
> > > I'm trying to understand why my recovery is so slow with only 2 pg
> > > backfilling.  I'm only getting speeds of 3-4/MiB/s on a 10G network.  I
> > > have tested the speed between machines with a few tools and all confirm
> > 10G
> > > speed.  I've tried changing various settings of priority and recovery
> > sleep
> > > hdd, but still the same. Is this a configuration issue or something
> else?
> > >
> > > It's just a small cluster right now with 4 hosts, 11 osd's per.  Please
> > let
> > > me know if you need more information.
> > >
> > > Thanks,
> > > Curt
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx>
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx