Re: Ceph recovery network speed

Curt <lightspd@xxxxxxxxx> · Wed, 29 Jun 2022 13:21:44 +0400

On Wed, Jun 29, 2022 at 1:06 PM Frank Schilder <frans@xxxxxx> wrote:

> Hi,
>
> did you wait for PG creation and peering to finish after setting pg_num
> and pgp_num? They should be right on the value you set and not lower.
>
Yes, only thing going on was backfill. It's still just slowly expanding pg
and pgp nums.   I even ran the set command again.  Here's the current info
ceph osd pool get EC-22-Pool all
size: 4
min_size: 3
pg_num: 226
pgp_num: 98
crush_rule: EC-22-Pool
hashpspool: true
allow_ec_overwrites: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
erasure_code_profile: EC-22-Pro
fast_read: 0
pg_autoscale_mode: off
eio: false
bulk: false

>
> > How do you set the upmap balancer per pool?
>
> I'm afraid the answer is RTFM. I don't use it, but I believe to remember
> one could configure it for equi-distribution of PGs for each pool.
>
> Ok, I'll dig around some more. I glanced at the balancer page and didn't
see it.

> Whenever you grow the cluster, you should make the same considerations
> again and select numbers of PG per pool depending on number of objects,
> capacity and performance.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Curt <lightspd@xxxxxxxxx>
> Sent: 28 June 2022 16:33:24
> To: Frank Schilder
> Cc: Robert Gallop; ceph-users@xxxxxxx
> Subject: Re:  Re: Ceph recovery network speed
>
> Hi Frank,
>
> Thank you for the thorough breakdown. I have increased the pg_num and
> pgp_num to 1024 to start on the ec-22 pool. That is going to be my primary
> pool with the most data.  It looks like ceph slowly scales the pg up even
> with autoscaling off, since I see target_pg_num 2048, pg_num 199.
>
> root@cephmgr:/# ceph osd pool set EC-22-Pool pg_num 2048
> set pool 12 pg_num to 2048
> root@cephmgr:/# ceph osd pool set EC-22-Pool pgp_num 2048
> set pool 12 pgp_num to 2048
> root@cephmgr:/# ceph osd pool get EC-22-Pool all
> size: 4
> min_size: 3
> pg_num: 199
> pgp_num: 71
> crush_rule: EC-22-Pool
> hashpspool: true
> allow_ec_overwrites: true
> nodelete: false
> nopgchange: false
> nosizechange: false
> write_fadvise_dontneed: false
> noscrub: false
> nodeep-scrub: false
> use_gmt_hitset: 1
> erasure_code_profile: EC-22-Pro
> fast_read: 0
> pg_autoscale_mode: off
> eio: false
> bulk: false
>
> This cluster will be growing quit a bit over the next few months.  I am
> migrating data from their old Giant cluster to a new one, by the time I'm
> done it should be 16 hosts with about 400TB of data. I'm guessing I'll have
> to increase pg again later when I start adding more servers to the cluster.
>
> I will look into if SSD's are an option.  How do you set the upmap
> balancer per pool?  Looking at ceph balancer status my mode is already
> upmap.
>
> Thanks again,
> Curt
>
> On Tue, Jun 28, 2022 at 1:23 AM Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx>> wrote:
> Hi Curt,
>
> looking at what you sent here, I believe you are the victim of "the law of
> large numbers really only holds for large numbers". In other words, the
> statistics of small samples is biting you. The PG numbers of your pools are
> so low that they lead to a very large imbalance of data- and IO placement.
> In other words, in your cluster a few OSDs receive the majority of IO
> requests and bottleneck the entire cluster.
>
> If I see this correctly, the PG num per drive varies from 14 to 40. That's
> an insane imbalance. Also, on your EC pool PG_num is 128 but PGP_num is
> only 48. The autoscaler is screwing it up for you. It will slowly increase
> the number of active PGs, causing continuous relocation of objects for a
> very long time.
>
> I think the recovery speed you see for 8 objects per second is not too bad
> considering that you have an HDD only cluster. The speed does not increase,
> because it is a small number of PGs sending data - a subset of the 32 you
> had before. In addition, due to the imbalance of PGs per OSD, only a small
> number of PGs will be able to send data. You will need patience to get out
> of this corner.
>
> The first thing I would do is look at which pools are important for your
> workload in the long run. I see 2 pools having a significant number of
> objects: EC-22-Pool and default.rgw.buckets.data. EC-22-Pool has about 40
> times the number of objects and bytes as default.rgw.buckets.data. I would
> scale both up in PG count with emphasis on EC-22-Pool.
>
> Your cluster can safely operate between 1100 and 2200 PGs with replication
> <=4. If you don't plan to create more large pools, a good choice of
> distributing this capacity might be
>
> EC-22-Pool: 1024 PGs (could be pushed up to 2048)
> default.rgw.buckets.data: 256 PGs
>
> That's towards the lower end of available PGs. Please make your own
> calculation and judgement.
>
> If you have settled on target numbers, change the pool sizes in one go,
> that is, set PG_num and PGP_num to the same value right away. You might
> need to turn autoscaler off for these 2 pools. The rebalancing will take a
> long time and also not speed up, because the few sending PGs are the
> bottleneck, not the receiving ones. You will have to sit it out.
>
> The goal is that, in the future, recovery and re-balancing are improved.
> In my experience, a reasonably high PG count will also reduce latency of
> client IO.
>
> Next thing to look at is distribution of PGs per OSD. This has an enormous
> performance impact, because a few too busy OSDs can throttle an entire
> cluster (its always the slowest disk that wins). I use the very simple
> reweight by utilization method, but my pools do not share OSDs as yours do.
> You might want to try the upmap balancer per pool to get PGs per pool
> evenly spread out over OSDs.
>
> Lastly, if you can afford it and your hosts have a slot left, consider
> buying one enterprise SSD per host for the meta-data pools to get this IO
> away from the HDDs. If you buy a bunch of 128G or 256G SATA SSDs, you can
> probably place everything except the EC-22-Pool on these drives, separating
> completely.
>
> Hope that helps and maybe someone else has ideas as well?
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>
> Sent: 27 June 2022 21:36:27
> To: Frank Schilder
> Cc: Robert Gallop; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> Subject: Re:  Re: Ceph recovery network speed
>
> On Mon, Jun 27, 2022 at 11:08 PM Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote:
> Do you, by any chance have SMR drives? This may not be stated on the
> drive, check what the internet has to say. I also would have liked to see
> the beginning of the ceph status, number of hosts, number of OSDs, up in
> down whatever. Can you also send the result of ceph osd df tree?
>      As far as I can tell none of the drives are SMR drives.
> I did have some inconsistent pop up, scrubs are still running.
>
> cluster:
>     id:     1684fe88-aae0-11ec-9593-df430e3982a0
>     health: HEALTH_ERR
>             10 scrub errors
>             Possible data damage: 4 pgs inconsistent
>
>   services:
>     mon: 5 daemons, quorum cephmgr,cephmon1,cephmon2,cephmon3,cephmgr2
> (age 8w)
>     mgr: cephmon1.fxtvtu(active, since 2d), standbys: cephmon2.wrzwwn,
> cephmgr2.hzsrdo, cephmgr.bazebq
>     osd: 44 osds: 44 up (since 3d), 44 in (since 3d); 28 remapped pgs
>     rgw: 2 daemons active (2 hosts, 1 zones)
>
>   data:
>     pools:   11 pools, 369 pgs
>     objects: 2.45M objects, 9.2 TiB
>     usage:   21 TiB used, 59 TiB / 80 TiB avail
>     pgs:     503944/9729081 objects misplaced (5.180%)
>              337 active+clean
>              28  active+remapped+backfilling
>              4   active+clean+inconsistent
>
>   io:
>     client:   1000 KiB/s rd, 717 KiB/s wr, 81 op/s rd, 57 op/s wr
>     recovery: 34 MiB/s, 8 objects/s
>
> ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE   DATA      OMAP     META
>    AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
> -1         80.05347         -   80 TiB    21 TiB    21 TiB   32 MiB   69
> GiB   59 TiB  26.23  1.00    -          root default
> -5         20.01337         -   20 TiB   5.3 TiB   5.3 TiB  1.4 MiB   19
> GiB   15 TiB  26.47  1.01    -              host hyperion01
>  1    hdd   1.81940   1.00000  1.8 TiB   749 GiB   747 GiB  224 KiB  2.2
> GiB  1.1 TiB  40.19  1.53   36      up          osd.1
>  3    hdd   1.81940   1.00000  1.8 TiB   531 GiB   530 GiB    3 KiB  1.9
> GiB  1.3 TiB  28.52  1.09   31      up          osd.3
>  5    hdd   1.81940   1.00000  1.8 TiB   167 GiB   166 GiB   36 KiB  1.2
> GiB  1.7 TiB   8.98  0.34   18      up          osd.5
>  7    hdd   1.81940   1.00000  1.8 TiB   318 GiB   316 GiB   83 KiB  1.2
> GiB  1.5 TiB  17.04  0.65   26      up          osd.7
>  9    hdd   1.81940   1.00000  1.8 TiB  1017 GiB  1014 GiB  139 KiB  2.6
> GiB  846 GiB  54.59  2.08   38      up          osd.9
> 11    hdd   1.81940   1.00000  1.8 TiB   569 GiB   567 GiB    4 KiB  2.1
> GiB  1.3 TiB  30.56  1.17   29      up          osd.11
> 13    hdd   1.81940   1.00000  1.8 TiB   293 GiB   291 GiB  338 KiB  1.5
> GiB  1.5 TiB  15.72  0.60   23      up          osd.13
> 15    hdd   1.81940   1.00000  1.8 TiB   368 GiB   366 GiB  641 KiB  1.6
> GiB  1.5 TiB  19.74  0.75   23      up          osd.15
> 17    hdd   1.81940   1.00000  1.8 TiB   369 GiB   367 GiB    2 KiB  1.5
> GiB  1.5 TiB  19.80  0.75   26      up          osd.17
> 19    hdd   1.81940   1.00000  1.8 TiB   404 GiB   403 GiB    7 KiB  1.1
> GiB  1.4 TiB  21.69  0.83   31      up          osd.19
> 45    hdd   1.81940   1.00000  1.8 TiB   639 GiB   637 GiB    2 KiB  2.0
> GiB  1.2 TiB  34.30  1.31   32      up          osd.45
> -3         20.01337         -   20 TiB   5.2 TiB   5.2 TiB  2.0 MiB   18
> GiB   15 TiB  26.15  1.00    -              host hyperion02
>  0    hdd   1.81940   1.00000  1.8 TiB   606 GiB   604 GiB  302 KiB  2.0
> GiB  1.2 TiB  32.52  1.24   33      up          osd.0
>  2    hdd   1.81940   1.00000  1.8 TiB    58 GiB    58 GiB  112 KiB  249
> MiB  1.8 TiB   3.14  0.12   14      up          osd.2
>  4    hdd   1.81940   1.00000  1.8 TiB   254 GiB   252 GiB   14 KiB  1.6
> GiB  1.6 TiB  13.63  0.52   28      up          osd.4
>  6    hdd   1.81940   1.00000  1.8 TiB   574 GiB   572 GiB    1 KiB  1.8
> GiB  1.3 TiB  30.81  1.17   26      up          osd.6
>  8    hdd   1.81940   1.00000  1.8 TiB   201 GiB   200 GiB  618 KiB  743
> MiB  1.6 TiB  10.77  0.41   23      up          osd.8
> 10    hdd   1.81940   1.00000  1.8 TiB   628 GiB   626 GiB    4 KiB  2.2
> GiB  1.2 TiB  33.72  1.29   37      up          osd.10
> 12    hdd   1.81940   1.00000  1.8 TiB   355 GiB   353 GiB  361 KiB  1.2
> GiB  1.5 TiB  19.03  0.73   30      up          osd.12
> 14    hdd   1.81940   1.00000  1.8 TiB   1.1 TiB   1.1 TiB    1 KiB  2.7
> GiB  708 GiB  62.00  2.36   38      up          osd.14
> 16    hdd   1.81940   1.00000  1.8 TiB   240 GiB   239 GiB    4 KiB  1.2
> GiB  1.6 TiB  12.90  0.49   20      up          osd.16
> 18    hdd   1.81940   1.00000  1.8 TiB   300 GiB   298 GiB  542 KiB  1.6
> GiB  1.5 TiB  16.08  0.61   21      up          osd.18
> 32    hdd   1.81940   1.00000  1.8 TiB   989 GiB   986 GiB   45 KiB  2.7
> GiB  874 GiB  53.09  2.02   36      up          osd.32
> -7         20.01337         -   20 TiB   5.2 TiB   5.2 TiB  2.9 MiB   17
> GiB   15 TiB  26.06  0.99    -              host hyperion03
> 22    hdd   1.81940   1.00000  1.8 TiB   449 GiB   448 GiB  443 KiB  1.5
> GiB  1.4 TiB  24.10  0.92   31      up          osd.22
> 23    hdd   1.81940   1.00000  1.8 TiB   299 GiB   298 GiB    5 KiB  1.4
> GiB  1.5 TiB  16.05  0.61   26      up          osd.23
> 24    hdd   1.81940   1.00000  1.8 TiB   735 GiB   733 GiB    8 KiB  2.3
> GiB  1.1 TiB  39.45  1.50   33      up          osd.24
> 25    hdd   1.81940   1.00000  1.8 TiB   519 GiB   517 GiB    5 KiB  1.4
> GiB  1.3 TiB  27.85  1.06   26      up          osd.25
> 26    hdd   1.81940   1.00000  1.8 TiB   483 GiB   481 GiB  614 KiB  1.7
> GiB  1.3 TiB  25.94  0.99   28      up          osd.26
> 27    hdd   1.81940   1.00000  1.8 TiB   226 GiB   225 GiB  1.5 MiB  1.0
> GiB  1.6 TiB  12.11  0.46   17      up          osd.27
> 28    hdd   1.81940   1.00000  1.8 TiB   443 GiB   441 GiB   24 KiB  1.5
> GiB  1.4 TiB  23.76  0.91   21      up          osd.28
> 29    hdd   1.81940   1.00000  1.8 TiB   801 GiB   799 GiB    7 KiB  2.2
> GiB  1.0 TiB  42.98  1.64   31      up          osd.29
> 30    hdd   1.81940   1.00000  1.8 TiB   523 GiB   522 GiB  174 KiB  1.2
> GiB  1.3 TiB  28.09  1.07   29      up          osd.30
> 31    hdd   1.81940   1.00000  1.8 TiB   322 GiB   321 GiB    4 KiB  1.2
> GiB  1.5 TiB  17.30  0.66   26      up          osd.31
> 44    hdd   1.81940   1.00000  1.8 TiB   541 GiB   540 GiB  136 KiB  1.4
> GiB  1.3 TiB  29.06  1.11   24      up          osd.44
> -9         20.01337         -   20 TiB   5.3 TiB   5.2 TiB   25 MiB   16
> GiB   15 TiB  26.25  1.00    -              host hyperion04
> 33    hdd   1.81940   1.00000  1.8 TiB   466 GiB   465 GiB  469 KiB  1.4
> GiB  1.4 TiB  25.02  0.95   28      up          osd.33
> 34    hdd   1.81940   1.00000  1.8 TiB   508 GiB   506 GiB    2 KiB  1.8
> GiB  1.3 TiB  27.28  1.04   30      up          osd.34
> 35    hdd   1.81940   1.00000  1.8 TiB   521 GiB   520 GiB    2 KiB  1.4
> GiB  1.3 TiB  27.98  1.07   32      up          osd.35
> 36    hdd   1.81940   1.00000  1.8 TiB   872 GiB   870 GiB    3 KiB  2.3
> GiB  991 GiB  46.81  1.78   40      up          osd.36
> 37    hdd   1.81940   1.00000  1.8 TiB   443 GiB   441 GiB  136 KiB  1.2
> GiB  1.4 TiB  23.75  0.91   25      up          osd.37
> 38    hdd   1.81940   1.00000  1.8 TiB   138 GiB   137 GiB   24 MiB  647
> MiB  1.7 TiB   7.40  0.28   27      up          osd.38
> 39    hdd   1.81940   1.00000  1.8 TiB   638 GiB   637 GiB  622 KiB  1.7
> GiB  1.2 TiB  34.26  1.31   33      up          osd.39
> 40    hdd   1.81940   1.00000  1.8 TiB   444 GiB   443 GiB   14 KiB  1.4
> GiB  1.4 TiB  23.85  0.91   25      up          osd.40
> 41    hdd   1.81940   1.00000  1.8 TiB   477 GiB   476 GiB  264 KiB  1.3
> GiB  1.4 TiB  25.60  0.98   31      up          osd.41
> 42    hdd   1.81940   1.00000  1.8 TiB   514 GiB   513 GiB   35 KiB  1.2
> GiB  1.3 TiB  27.61  1.05   29      up          osd.42
> 43    hdd   1.81940   1.00000  1.8 TiB   358 GiB   356 GiB  111 KiB  1.2
> GiB  1.5 TiB  19.19  0.73   24      up          osd.43
>                         TOTAL   80 TiB    21 TiB    21 TiB   32 MiB   69
> GiB   59 TiB  26.23
> MIN/MAX VAR: 0.12/2.36  STDDEV: 12.47
>
> The number of objects in flight looks small. Your objects seem to have an
> average size of 4MB and should recover with full bandwidth. Check with top
> how much IO wait percentage you have on the OSD hosts.
> iowait is 3.3% and load avg is 3.7, nothing crazy from what I can tell.
>
>
> The one thing that jumps to my eye though is, that you only have 22 dirty
> PGs and they are all recovering/backfilling already. I wonder if you have a
> problem with your crush rules, they might not do what you think they do.
> You said you increased the PG count for EC-22-Pool to 128 (from what?) but
> it doesn't really look like a suitable number of PGs has been marked for
> backfilling. Can you post the output of "ceph osd pool get EC-22-Pool all"?
> From 32 to 128
> ceph osd pool get EC-22-Pool all
> size: 4
> min_size: 3
> pg_num: 128
> pgp_num: 48
> crush_rule: EC-22-Pool
> hashpspool: true
> allow_ec_overwrites: true
> nodelete: false
> nopgchange: false
> nosizechange: false
> write_fadvise_dontneed: false
> noscrub: false
> nodeep-scrub: false
> use_gmt_hitset: 1
> erasure_code_profile: EC-22-Pro
> fast_read: 0
> pg_autoscale_mode: on
> eio: false
> bulk: false
>
>
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:
> lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>>
> Sent: 27 June 2022 19:41:06
> To: Robert Gallop
> Cc: Frank Schilder; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:
> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
> Subject: Re:  Re: Ceph recovery network speed
>
> I would love to see those types of speeds. I tried setting it all the way
> to 0 and nothing, I did that before I sent the first email, maybe it was
> your old post I got it from.
>
> osd_recovery_sleep_hdd                           0.000000
>
>
>                                  override  (mon[0.000000])
>
> On Mon, Jun 27, 2022 at 9:27 PM Robert Gallop <robert.gallop@xxxxxxxxx
> <mailto:robert.gallop@xxxxxxxxx><mailto:robert.gallop@xxxxxxxxx<mailto:
> robert.gallop@xxxxxxxxx>><mailto:robert.gallop@xxxxxxxxx<mailto:
> robert.gallop@xxxxxxxxx><mailto:robert.gallop@xxxxxxxxx<mailto:
> robert.gallop@xxxxxxxxx>>>> wrote:
> I saw a major boost after having the sleep_hdd set to 0.  Only after that
> did I start staying at around 500MiB to 1.2GiB/sec and 1.5k obj/sec to 2.5k
> obj/sec.
>
> Eventually it tapered back down, but for me sleep was the key, and
> specifically in my case:
>
> osd_recovery_sleep_hdd
>
> On Mon, Jun 27, 2022 at 11:17 AM Curt <lightspd@xxxxxxxxx<mailto:
> lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx
> >><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:
> lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>>> wrote:
> On Mon, Jun 27, 2022 at 8:52 PM Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>><mailto:
> frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>>>
> wrote:
>
> > I think this is just how ceph is. Maybe you should post the output of
> > "ceph status", "ceph osd pool stats" and "ceph df" so that we can get an
> > idea whether what you look at is expected or not. As I wrote before,
> object
> > recovery is throttled and the recovery bandwidth depends heavily on
> object
> > size. The interesting question is, how many objects per second are
> > recovered/rebalanced
> >
>  data:
>     pools:   11 pools, 369 pgs
>     objects: 2.45M objects, 9.2 TiB
>     usage:   20 TiB used, 60 TiB / 80 TiB avail
>     pgs:     512136/9729081 objects misplaced (5.264%)
>              343 active+clean
>              22  active+remapped+backfilling
>
>   io:
>     client:   2.0 MiB/s rd, 344 KiB/s wr, 142 op/s rd, 69 op/s wr
>     recovery: 34 MiB/s, 8 objects/s
>
> Pool 12 is the only one with any stats.
>
> pool EC-22-Pool id 12
>   510048/9545052 objects misplaced (5.344%)
>   recovery io 36 MiB/s, 9 objects/s
>   client io 1.8 MiB/s rd, 404 KiB/s wr, 86 op/s rd, 72 op/s wr
>
> --- RAW STORAGE ---
> CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
> hdd    80 TiB  60 TiB  20 TiB    20 TiB      25.45
> TOTAL  80 TiB  60 TiB  20 TiB    20 TiB      25.45
>
> --- POOLS ---
> POOL                        ID  PGS   STORED  OBJECTS     USED  %USED  MAX
> AVAIL
> .mgr                         1    1  152 MiB       38  457 MiB      0
>  9.2 TiB
> 21BadPool                    3   32    8 KiB        1   12 KiB      0
> 18 TiB
> .rgw.root                    4   32  1.3 KiB        4   48 KiB      0
>  9.2 TiB
> default.rgw.log              5   32  3.6 KiB      209  408 KiB      0
>  9.2 TiB
> default.rgw.control          6   32      0 B        8      0 B      0
>  9.2 TiB
> default.rgw.meta             7    8  6.7 KiB       20  203 KiB      0
>  9.2 TiB
> rbd_rep_pool                 8   32  2.0 MiB        5  5.9 MiB      0
>  9.2 TiB
> default.rgw.buckets.index    9    8  2.0 MiB       33  5.9 MiB      0
>  9.2 TiB
> default.rgw.buckets.non-ec  10   32  1.4 KiB        0  4.3 KiB      0
>  9.2 TiB
> default.rgw.buckets.data    11   32  232 GiB   61.02k  697 GiB   2.41
>  9.2 TiB
> EC-22-Pool                  12  128  9.8 TiB    2.39M   20 TiB  41.55
> 14 TiB
>
>
>
> > Maybe provide the output of the first two commands for
> > osd_recovery_sleep_hdd=0.05 and osd_recovery_sleep_hdd=0.1 each (wait a
> bit
> > after setting these and then collect the output). Include the applied
> > values for osd_max_backfills* and osd_recovery_max_active* for one of the
> > OSDs in the pool (ceph config show osd.ID | grep -e osd_max_backfills -e
> > osd_recovery_max_active).
> >
>
> I didn't notice any speed difference with sleep values changed, but I'll
> grab the stats between changes when I have a chance.
>
> ceph config show osd.19 | egrep 'osd_max_backfills|osd_recovery_max_active'
> osd_max_backfills                                1000
>
>
>                                 override  mon[5]
> osd_recovery_max_active                          1000
>
>
>                                 override
> osd_recovery_max_active_hdd                      1000
>
>
>                                 override  mon[5]
> osd_recovery_max_active_ssd                      1000
>
>
>                                 override
>
> >
> > I don't really know if on such a small cluster one can expect more than
> > what you see. It has nothing to do with network speed if you have a 10G
> > line. However, recovery is something completely different from a full
> > link-speed copy.
> >
> > I can tell you that boatloads of tiny objects are a huge pain for
> > recovery, even on SSD. Ceph doesn't raid up sections of disks against
> each
> > other, but object for object. This might be a feature request: that PG
> > space allocation and recovery should follow the model of LVM extends
> > (ideally match with LVM extends) to allow recovery/rebalancing larger
> > chunks of storage in one go, containing parts of a large or many small
> > objects.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:
> lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>><mailto:lightspd@xxxxxxxxx
> <mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:
> lightspd@xxxxxxxxx>>>>
> > Sent: 27 June 2022 17:35:19
> > To: Frank Schilder
> > Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:
> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:ceph-users@xxxxxxx
> <mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx>>>
> > Subject: Re:  Re: Ceph recovery network speed
> >
> > Hello,
> >
> > I had already increased/changed those variables previously.  I increased
> > the pg_num to 128. Which increased the number of PG's backfilling, but
> > speed is still only at 30 MiB/s avg and has been backfilling 23 pg for
> the
> > last several hours.  Should I increase it higher than 128?
> >
> > I'm still trying to figure out if this is just how ceph is or if there is
> > a bottleneck somewhere.  Like if I sftp a 10G file between servers it's
> > done in a couple min or less.  Am I thinking of this wrong?
> >
> > Thanks,
> > Curt
> >
> > On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder <frans@xxxxxx<mailto:
> frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>><mailto:
> frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx
> >>><mailto:
> > frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:
> frans@xxxxxx>><mailto:frans@xxxxxx<mailto:frans@xxxxxx><mailto:
> frans@xxxxxx<mailto:frans@xxxxxx>>>>> wrote:
> > Hi Curt,
> >
> > as far as I understood, a 2+2 EC pool is recovering, which makes 1 OSD
> per
> > host busy. My experience is, that the algorithm for selecting PGs to
> > backfill/recover is not very smart. It could simply be that it doesn't
> find
> > more PGs without violating some of these settings:
> >
> > osd_max_backfills
> > osd_recovery_max_active
> >
> > I have never observed the second parameter to change anything (try any
> > ways). However, the first one has a large impact. You could try
> increasing
> > this slowly until recovery moves faster. Another parameter you might want
> > to try is
> >
> > osd_recovery_sleep_[hdd|ssd]
> >
> > Be careful as this will impact client IO. I could reduce the sleep for my
> > HDDs to 0.05. With your workload pattern, this might be something you can
> > tune as well.
> >
> > Having said that, I think you should increase your PG count on the EC
> pool
> > as soon as the cluster is healthy. You have only about 20 PGs per OSD and
> > large PGs will take unnecessarily long to recover. A higher PG count will
> > also make it easier for the scheduler to find PGs for recovery/backfill.
> > Aim for a number between 100 and 200. Give the pool(s) with most data
> > (#objects) the most PGs.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:
> lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>><mailto:lightspd@xxxxxxxxx
> <mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:
> lightspd@xxxxxxxxx>>><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx
> ><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>><mailto:
> lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx
> <mailto:lightspd@xxxxxxxxx>>>>>
> > Sent: 24 June 2022 19:04
> > To: Anthony D'Atri; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
> ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:
> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx
> <mailto:ceph-users@xxxxxxx>>><mailto:ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
> >><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:
> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>>>
> > Subject:  Re: Ceph recovery network speed
> >
> > 2 PG's shouldn't take hours to backfill in my opinion.  Just 2TB
> enterprise
> > HD's.
> >
> > Take this log entry below, 72 minutes and still backfilling undersized?
> > Should it be that slow?
> >
> > pg 12.15 is stuck undersized for 72m, current state
> > active+undersized+degraded+remapped+backfilling, last acting
> > [34,10,29,NONE]
> >
> > Thanks,
> > Curt
> >
> >
> > On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri <anthony.datri@xxxxxxxxx
> <mailto:anthony.datri@xxxxxxxxx><mailto:anthony.datri@xxxxxxxxx<mailto:
> anthony.datri@xxxxxxxxx>><mailto:anthony.datri@xxxxxxxxx<mailto:
> anthony.datri@xxxxxxxxx><mailto:anthony.datri@xxxxxxxxx<mailto:
> anthony.datri@xxxxxxxxx>>>
> > <mailto:anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx><mailto:
> anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx>><mailto:
> anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx><mailto:
> anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx>>>>>
> > wrote:
> >
> > > Your recovery is slow *because* there are only 2 PGs backfilling.
> > >
> > > What kind of OSD media are you using?
> > >
> > > > On Jun 24, 2022, at 09:46, Curt <lightspd@xxxxxxxxx<mailto:
> lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx
> >><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:
> lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>><mailto:
> > lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx
> <mailto:lightspd@xxxxxxxxx>><mailto:lightspd@xxxxxxxxx<mailto:
> lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>>>>
> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I'm trying to understand why my recovery is so slow with only 2 pg
> > > > backfilling.  I'm only getting speeds of 3-4/MiB/s on a 10G
> network.  I
> > > > have tested the speed between machines with a few tools and all
> confirm
> > > 10G
> > > > speed.  I've tried changing various settings of priority and recovery
> > > sleep
> > > > hdd, but still the same. Is this a configuration issue or something
> > else?
> > > >
> > > > It's just a small cluster right now with 4 hosts, 11 osd's per.
> Please
> > > let
> > > > me know if you need more information.
> > > >
> > > > Thanks,
> > > > Curt
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
> >><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:
> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>><mailto:
> > ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx
> <mailto:ceph-users@xxxxxxx>><mailto:ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
> >>>>
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>>><mailto:
> > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:
> ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>><mailto:
> ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:
> ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>>>
> > >
> > >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
> ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:
> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx
> <mailto:ceph-users@xxxxxxx>>><mailto:ceph-users@xxxxxxx<mailto:
> ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
> >><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:
> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>>>
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>>><mailto:
> > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:
> ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>><mailto:
> ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:
> ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>>>
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx
> ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:
> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx
> <mailto:ceph-users@xxxxxxx>>>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:
> ceph-users-leave@xxxxxxx>>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx