Re: Ceph recovery network speed

Frank Schilder <frans@xxxxxx> · Wed, 29 Jun 2022 09:06:10 +0000

Hi,

did you wait for PG creation and peering to finish after setting pg_num and pgp_num? They should be right on the value you set and not lower.

> How do you set the upmap balancer per pool?

I'm afraid the answer is RTFM. I don't use it, but I believe to remember one could configure it for equi-distribution of PGs for each pool.

Whenever you grow the cluster, you should make the same considerations again and select numbers of PG per pool depending on number of objects, capacity and performance.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Curt <lightspd@xxxxxxxxx>
Sent: 28 June 2022 16:33:24
To: Frank Schilder
Cc: Robert Gallop; ceph-users@xxxxxxx
Subject: Re:  Re: Ceph recovery network speed

Hi Frank,

Thank you for the thorough breakdown. I have increased the pg_num and pgp_num to 1024 to start on the ec-22 pool. That is going to be my primary pool with the most data.  It looks like ceph slowly scales the pg up even with autoscaling off, since I see target_pg_num 2048, pg_num 199.

root@cephmgr:/# ceph osd pool set EC-22-Pool pg_num 2048
set pool 12 pg_num to 2048
root@cephmgr:/# ceph osd pool set EC-22-Pool pgp_num 2048
set pool 12 pgp_num to 2048
root@cephmgr:/# ceph osd pool get EC-22-Pool all
size: 4
min_size: 3
pg_num: 199
pgp_num: 71
crush_rule: EC-22-Pool
hashpspool: true
allow_ec_overwrites: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
erasure_code_profile: EC-22-Pro
fast_read: 0
pg_autoscale_mode: off
eio: false
bulk: false

This cluster will be growing quit a bit over the next few months.  I am migrating data from their old Giant cluster to a new one, by the time I'm done it should be 16 hosts with about 400TB of data. I'm guessing I'll have to increase pg again later when I start adding more servers to the cluster.

I will look into if SSD's are an option.  How do you set the upmap balancer per pool?  Looking at ceph balancer status my mode is already upmap.

Thanks again,
Curt

On Tue, Jun 28, 2022 at 1:23 AM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
Hi Curt,

looking at what you sent here, I believe you are the victim of "the law of large numbers really only holds for large numbers". In other words, the statistics of small samples is biting you. The PG numbers of your pools are so low that they lead to a very large imbalance of data- and IO placement. In other words, in your cluster a few OSDs receive the majority of IO requests and bottleneck the entire cluster.

If I see this correctly, the PG num per drive varies from 14 to 40. That's an insane imbalance. Also, on your EC pool PG_num is 128 but PGP_num is only 48. The autoscaler is screwing it up for you. It will slowly increase the number of active PGs, causing continuous relocation of objects for a very long time.

I think the recovery speed you see for 8 objects per second is not too bad considering that you have an HDD only cluster. The speed does not increase, because it is a small number of PGs sending data - a subset of the 32 you had before. In addition, due to the imbalance of PGs per OSD, only a small number of PGs will be able to send data. You will need patience to get out of this corner.

The first thing I would do is look at which pools are important for your workload in the long run. I see 2 pools having a significant number of objects: EC-22-Pool and default.rgw.buckets.data. EC-22-Pool has about 40 times the number of objects and bytes as default.rgw.buckets.data. I would scale both up in PG count with emphasis on EC-22-Pool.

Your cluster can safely operate between 1100 and 2200 PGs with replication <=4. If you don't plan to create more large pools, a good choice of distributing this capacity might be

EC-22-Pool: 1024 PGs (could be pushed up to 2048)
default.rgw.buckets.data: 256 PGs

That's towards the lower end of available PGs. Please make your own calculation and judgement.

If you have settled on target numbers, change the pool sizes in one go, that is, set PG_num and PGP_num to the same value right away. You might need to turn autoscaler off for these 2 pools. The rebalancing will take a long time and also not speed up, because the few sending PGs are the bottleneck, not the receiving ones. You will have to sit it out.

The goal is that, in the future, recovery and re-balancing are improved. In my experience, a reasonably high PG count will also reduce latency of client IO.

Next thing to look at is distribution of PGs per OSD. This has an enormous performance impact, because a few too busy OSDs can throttle an entire cluster (its always the slowest disk that wins). I use the very simple reweight by utilization method, but my pools do not share OSDs as yours do. You might want to try the upmap balancer per pool to get PGs per pool evenly spread out over OSDs.

Lastly, if you can afford it and your hosts have a slot left, consider buying one enterprise SSD per host for the meta-data pools to get this IO away from the HDDs. If you buy a bunch of 128G or 256G SATA SSDs, you can probably place everything except the EC-22-Pool on these drives, separating completely.

Hope that helps and maybe someone else has ideas as well?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>
Sent: 27 June 2022 21:36:27
To: Frank Schilder
Cc: Robert Gallop; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
Subject: Re:  Re: Ceph recovery network speed

On Mon, Jun 27, 2022 at 11:08 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote:
Do you, by any chance have SMR drives? This may not be stated on the drive, check what the internet has to say. I also would have liked to see the beginning of the ceph status, number of hosts, number of OSDs, up in down whatever. Can you also send the result of ceph osd df tree?
     As far as I can tell none of the drives are SMR drives.
I did have some inconsistent pop up, scrubs are still running.

cluster:
    id:     1684fe88-aae0-11ec-9593-df430e3982a0
    health: HEALTH_ERR
            10 scrub errors
            Possible data damage: 4 pgs inconsistent

  services:
    mon: 5 daemons, quorum cephmgr,cephmon1,cephmon2,cephmon3,cephmgr2 (age 8w)
    mgr: cephmon1.fxtvtu(active, since 2d), standbys: cephmon2.wrzwwn, cephmgr2.hzsrdo, cephmgr.bazebq
    osd: 44 osds: 44 up (since 3d), 44 in (since 3d); 28 remapped pgs
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    pools:   11 pools, 369 pgs
    objects: 2.45M objects, 9.2 TiB
    usage:   21 TiB used, 59 TiB / 80 TiB avail
    pgs:     503944/9729081 objects misplaced (5.180%)
             337 active+clean
             28  active+remapped+backfilling
             4   active+clean+inconsistent

  io:
    client:   1000 KiB/s rd, 717 KiB/s wr, 81 op/s rd, 57 op/s wr
    recovery: 34 MiB/s, 8 objects/s

ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE   DATA      OMAP     META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
-1         80.05347         -   80 TiB    21 TiB    21 TiB   32 MiB   69 GiB   59 TiB  26.23  1.00    -          root default
-5         20.01337         -   20 TiB   5.3 TiB   5.3 TiB  1.4 MiB   19 GiB   15 TiB  26.47  1.01    -              host hyperion01
 1    hdd   1.81940   1.00000  1.8 TiB   749 GiB   747 GiB  224 KiB  2.2 GiB  1.1 TiB  40.19  1.53   36      up          osd.1
 3    hdd   1.81940   1.00000  1.8 TiB   531 GiB   530 GiB    3 KiB  1.9 GiB  1.3 TiB  28.52  1.09   31      up          osd.3
 5    hdd   1.81940   1.00000  1.8 TiB   167 GiB   166 GiB   36 KiB  1.2 GiB  1.7 TiB   8.98  0.34   18      up          osd.5
 7    hdd   1.81940   1.00000  1.8 TiB   318 GiB   316 GiB   83 KiB  1.2 GiB  1.5 TiB  17.04  0.65   26      up          osd.7
 9    hdd   1.81940   1.00000  1.8 TiB  1017 GiB  1014 GiB  139 KiB  2.6 GiB  846 GiB  54.59  2.08   38      up          osd.9
11    hdd   1.81940   1.00000  1.8 TiB   569 GiB   567 GiB    4 KiB  2.1 GiB  1.3 TiB  30.56  1.17   29      up          osd.11
13    hdd   1.81940   1.00000  1.8 TiB   293 GiB   291 GiB  338 KiB  1.5 GiB  1.5 TiB  15.72  0.60   23      up          osd.13
15    hdd   1.81940   1.00000  1.8 TiB   368 GiB   366 GiB  641 KiB  1.6 GiB  1.5 TiB  19.74  0.75   23      up          osd.15
17    hdd   1.81940   1.00000  1.8 TiB   369 GiB   367 GiB    2 KiB  1.5 GiB  1.5 TiB  19.80  0.75   26      up          osd.17
19    hdd   1.81940   1.00000  1.8 TiB   404 GiB   403 GiB    7 KiB  1.1 GiB  1.4 TiB  21.69  0.83   31      up          osd.19
45    hdd   1.81940   1.00000  1.8 TiB   639 GiB   637 GiB    2 KiB  2.0 GiB  1.2 TiB  34.30  1.31   32      up          osd.45
-3         20.01337         -   20 TiB   5.2 TiB   5.2 TiB  2.0 MiB   18 GiB   15 TiB  26.15  1.00    -              host hyperion02
 0    hdd   1.81940   1.00000  1.8 TiB   606 GiB   604 GiB  302 KiB  2.0 GiB  1.2 TiB  32.52  1.24   33      up          osd.0
 2    hdd   1.81940   1.00000  1.8 TiB    58 GiB    58 GiB  112 KiB  249 MiB  1.8 TiB   3.14  0.12   14      up          osd.2
 4    hdd   1.81940   1.00000  1.8 TiB   254 GiB   252 GiB   14 KiB  1.6 GiB  1.6 TiB  13.63  0.52   28      up          osd.4
 6    hdd   1.81940   1.00000  1.8 TiB   574 GiB   572 GiB    1 KiB  1.8 GiB  1.3 TiB  30.81  1.17   26      up          osd.6
 8    hdd   1.81940   1.00000  1.8 TiB   201 GiB   200 GiB  618 KiB  743 MiB  1.6 TiB  10.77  0.41   23      up          osd.8
10    hdd   1.81940   1.00000  1.8 TiB   628 GiB   626 GiB    4 KiB  2.2 GiB  1.2 TiB  33.72  1.29   37      up          osd.10
12    hdd   1.81940   1.00000  1.8 TiB   355 GiB   353 GiB  361 KiB  1.2 GiB  1.5 TiB  19.03  0.73   30      up          osd.12
14    hdd   1.81940   1.00000  1.8 TiB   1.1 TiB   1.1 TiB    1 KiB  2.7 GiB  708 GiB  62.00  2.36   38      up          osd.14
16    hdd   1.81940   1.00000  1.8 TiB   240 GiB   239 GiB    4 KiB  1.2 GiB  1.6 TiB  12.90  0.49   20      up          osd.16
18    hdd   1.81940   1.00000  1.8 TiB   300 GiB   298 GiB  542 KiB  1.6 GiB  1.5 TiB  16.08  0.61   21      up          osd.18
32    hdd   1.81940   1.00000  1.8 TiB   989 GiB   986 GiB   45 KiB  2.7 GiB  874 GiB  53.09  2.02   36      up          osd.32
-7         20.01337         -   20 TiB   5.2 TiB   5.2 TiB  2.9 MiB   17 GiB   15 TiB  26.06  0.99    -              host hyperion03
22    hdd   1.81940   1.00000  1.8 TiB   449 GiB   448 GiB  443 KiB  1.5 GiB  1.4 TiB  24.10  0.92   31      up          osd.22
23    hdd   1.81940   1.00000  1.8 TiB   299 GiB   298 GiB    5 KiB  1.4 GiB  1.5 TiB  16.05  0.61   26      up          osd.23
24    hdd   1.81940   1.00000  1.8 TiB   735 GiB   733 GiB    8 KiB  2.3 GiB  1.1 TiB  39.45  1.50   33      up          osd.24
25    hdd   1.81940   1.00000  1.8 TiB   519 GiB   517 GiB    5 KiB  1.4 GiB  1.3 TiB  27.85  1.06   26      up          osd.25
26    hdd   1.81940   1.00000  1.8 TiB   483 GiB   481 GiB  614 KiB  1.7 GiB  1.3 TiB  25.94  0.99   28      up          osd.26
27    hdd   1.81940   1.00000  1.8 TiB   226 GiB   225 GiB  1.5 MiB  1.0 GiB  1.6 TiB  12.11  0.46   17      up          osd.27
28    hdd   1.81940   1.00000  1.8 TiB   443 GiB   441 GiB   24 KiB  1.5 GiB  1.4 TiB  23.76  0.91   21      up          osd.28
29    hdd   1.81940   1.00000  1.8 TiB   801 GiB   799 GiB    7 KiB  2.2 GiB  1.0 TiB  42.98  1.64   31      up          osd.29
30    hdd   1.81940   1.00000  1.8 TiB   523 GiB   522 GiB  174 KiB  1.2 GiB  1.3 TiB  28.09  1.07   29      up          osd.30
31    hdd   1.81940   1.00000  1.8 TiB   322 GiB   321 GiB    4 KiB  1.2 GiB  1.5 TiB  17.30  0.66   26      up          osd.31
44    hdd   1.81940   1.00000  1.8 TiB   541 GiB   540 GiB  136 KiB  1.4 GiB  1.3 TiB  29.06  1.11   24      up          osd.44
-9         20.01337         -   20 TiB   5.3 TiB   5.2 TiB   25 MiB   16 GiB   15 TiB  26.25  1.00    -              host hyperion04
33    hdd   1.81940   1.00000  1.8 TiB   466 GiB   465 GiB  469 KiB  1.4 GiB  1.4 TiB  25.02  0.95   28      up          osd.33
34    hdd   1.81940   1.00000  1.8 TiB   508 GiB   506 GiB    2 KiB  1.8 GiB  1.3 TiB  27.28  1.04   30      up          osd.34
35    hdd   1.81940   1.00000  1.8 TiB   521 GiB   520 GiB    2 KiB  1.4 GiB  1.3 TiB  27.98  1.07   32      up          osd.35
36    hdd   1.81940   1.00000  1.8 TiB   872 GiB   870 GiB    3 KiB  2.3 GiB  991 GiB  46.81  1.78   40      up          osd.36
37    hdd   1.81940   1.00000  1.8 TiB   443 GiB   441 GiB  136 KiB  1.2 GiB  1.4 TiB  23.75  0.91   25      up          osd.37
38    hdd   1.81940   1.00000  1.8 TiB   138 GiB   137 GiB   24 MiB  647 MiB  1.7 TiB   7.40  0.28   27      up          osd.38
39    hdd   1.81940   1.00000  1.8 TiB   638 GiB   637 GiB  622 KiB  1.7 GiB  1.2 TiB  34.26  1.31   33      up          osd.39
40    hdd   1.81940   1.00000  1.8 TiB   444 GiB   443 GiB   14 KiB  1.4 GiB  1.4 TiB  23.85  0.91   25      up          osd.40
41    hdd   1.81940   1.00000  1.8 TiB   477 GiB   476 GiB  264 KiB  1.3 GiB  1.4 TiB  25.60  0.98   31      up          osd.41
42    hdd   1.81940   1.00000  1.8 TiB   514 GiB   513 GiB   35 KiB  1.2 GiB  1.3 TiB  27.61  1.05   29      up          osd.42
43    hdd   1.81940   1.00000  1.8 TiB   358 GiB   356 GiB  111 KiB  1.2 GiB  1.5 TiB  19.19  0.73   24      up          osd.43
                        TOTAL   80 TiB    21 TiB    21 TiB   32 MiB   69 GiB   59 TiB  26.23
MIN/MAX VAR: 0.12/2.36  STDDEV: 12.47

The number of objects in flight looks small. Your objects seem to have an average size of 4MB and should recover with full bandwidth. Check with top how much IO wait percentage you have on the OSD hosts.
iowait is 3.3% and load avg is 3.7, nothing crazy from what I can tell.

The one thing that jumps to my eye though is, that you only have 22 dirty PGs and they are all recovering/backfilling already. I wonder if you have a problem with your crush rules, they might not do what you think they do. You said you increased the PG count for EC-22-Pool to 128 (from what?) but it doesn't really look like a suitable number of PGs has been marked for backfilling. Can you post the output of "ceph osd pool get EC-22-Pool all"?
>From 32 to 128
ceph osd pool get EC-22-Pool all
size: 4
min_size: 3
pg_num: 128
pgp_num: 48
crush_rule: EC-22-Pool
hashpspool: true
allow_ec_overwrites: true
nodelete: false
nopgchange: false
nosizechange: false
write_fadvise_dontneed: false
noscrub: false
nodeep-scrub: false
use_gmt_hitset: 1
erasure_code_profile: EC-22-Pro
fast_read: 0
pg_autoscale_mode: on
eio: false
bulk: false

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>>
Sent: 27 June 2022 19:41:06
To: Robert Gallop
Cc: Frank Schilder; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>
Subject: Re:  Re: Ceph recovery network speed

I would love to see those types of speeds. I tried setting it all the way to 0 and nothing, I did that before I sent the first email, maybe it was your old post I got it from.

osd_recovery_sleep_hdd                           0.000000                                                                                                                                                                                                           override  (mon[0.000000])

On Mon, Jun 27, 2022 at 9:27 PM Robert Gallop <robert.gallop@xxxxxxxxx<mailto:robert.gallop@xxxxxxxxx><mailto:robert.gallop@xxxxxxxxx<mailto:robert.gallop@xxxxxxxxx>><mailto:robert.gallop@xxxxxxxxx<mailto:robert.gallop@xxxxxxxxx><mailto:robert.gallop@xxxxxxxxx<mailto:robert.gallop@xxxxxxxxx>>>> wrote:
I saw a major boost after having the sleep_hdd set to 0.  Only after that did I start staying at around 500MiB to 1.2GiB/sec and 1.5k obj/sec to 2.5k obj/sec.

Eventually it tapered back down, but for me sleep was the key, and specifically in my case:

osd_recovery_sleep_hdd

On Mon, Jun 27, 2022 at 11:17 AM Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>>> wrote:
On Mon, Jun 27, 2022 at 8:52 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>><mailto:frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>>> wrote:

> I think this is just how ceph is. Maybe you should post the output of
> "ceph status", "ceph osd pool stats" and "ceph df" so that we can get an
> idea whether what you look at is expected or not. As I wrote before, object
> recovery is throttled and the recovery bandwidth depends heavily on object
> size. The interesting question is, how many objects per second are
> recovered/rebalanced
>
 data:
    pools:   11 pools, 369 pgs
    objects: 2.45M objects, 9.2 TiB
    usage:   20 TiB used, 60 TiB / 80 TiB avail
    pgs:     512136/9729081 objects misplaced (5.264%)
             343 active+clean
             22  active+remapped+backfilling

  io:
    client:   2.0 MiB/s rd, 344 KiB/s wr, 142 op/s rd, 69 op/s wr
    recovery: 34 MiB/s, 8 objects/s

Pool 12 is the only one with any stats.

pool EC-22-Pool id 12
  510048/9545052 objects misplaced (5.344%)
  recovery io 36 MiB/s, 9 objects/s
  client io 1.8 MiB/s rd, 404 KiB/s wr, 86 op/s rd, 72 op/s wr

--- RAW STORAGE ---
CLASS    SIZE   AVAIL    USED  RAW USED  %RAW USED
hdd    80 TiB  60 TiB  20 TiB    20 TiB      25.45
TOTAL  80 TiB  60 TiB  20 TiB    20 TiB      25.45

--- POOLS ---
POOL                        ID  PGS   STORED  OBJECTS     USED  %USED  MAX
AVAIL
.mgr                         1    1  152 MiB       38  457 MiB      0
 9.2 TiB
21BadPool                    3   32    8 KiB        1   12 KiB      0
18 TiB
.rgw.root                    4   32  1.3 KiB        4   48 KiB      0
 9.2 TiB
default.rgw.log              5   32  3.6 KiB      209  408 KiB      0
 9.2 TiB
default.rgw.control          6   32      0 B        8      0 B      0
 9.2 TiB
default.rgw.meta             7    8  6.7 KiB       20  203 KiB      0
 9.2 TiB
rbd_rep_pool                 8   32  2.0 MiB        5  5.9 MiB      0
 9.2 TiB
default.rgw.buckets.index    9    8  2.0 MiB       33  5.9 MiB      0
 9.2 TiB
default.rgw.buckets.non-ec  10   32  1.4 KiB        0  4.3 KiB      0
 9.2 TiB
default.rgw.buckets.data    11   32  232 GiB   61.02k  697 GiB   2.41
 9.2 TiB
EC-22-Pool                  12  128  9.8 TiB    2.39M   20 TiB  41.55
14 TiB

> Maybe provide the output of the first two commands for
> osd_recovery_sleep_hdd=0.05 and osd_recovery_sleep_hdd=0.1 each (wait a bit
> after setting these and then collect the output). Include the applied
> values for osd_max_backfills* and osd_recovery_max_active* for one of the
> OSDs in the pool (ceph config show osd.ID | grep -e osd_max_backfills -e
> osd_recovery_max_active).
>

I didn't notice any speed difference with sleep values changed, but I'll
grab the stats between changes when I have a chance.

ceph config show osd.19 | egrep 'osd_max_backfills|osd_recovery_max_active'
osd_max_backfills                                1000

                                override  mon[5]
osd_recovery_max_active                          1000

                                override
osd_recovery_max_active_hdd                      1000

                                override  mon[5]
osd_recovery_max_active_ssd                      1000

                                override

>
> I don't really know if on such a small cluster one can expect more than
> what you see. It has nothing to do with network speed if you have a 10G
> line. However, recovery is something completely different from a full
> link-speed copy.
>
> I can tell you that boatloads of tiny objects are a huge pain for
> recovery, even on SSD. Ceph doesn't raid up sections of disks against each
> other, but object for object. This might be a feature request: that PG
> space allocation and recovery should follow the model of LVM extends
> (ideally match with LVM extends) to allow recovery/rebalancing larger
> chunks of storage in one go, containing parts of a large or many small
> objects.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>>>
> Sent: 27 June 2022 17:35:19
> To: Frank Schilder
> Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>>
> Subject: Re:  Re: Ceph recovery network speed
>
> Hello,
>
> I had already increased/changed those variables previously.  I increased
> the pg_num to 128. Which increased the number of PG's backfilling, but
> speed is still only at 30 MiB/s avg and has been backfilling 23 pg for the
> last several hours.  Should I increase it higher than 128?
>
> I'm still trying to figure out if this is just how ceph is or if there is
> a bottleneck somewhere.  Like if I sftp a 10G file between servers it's
> done in a couple min or less.  Am I thinking of this wrong?
>
> Thanks,
> Curt
>
> On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>><mailto:frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>><mailto:
> frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>><mailto:frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>>>> wrote:
> Hi Curt,
>
> as far as I understood, a 2+2 EC pool is recovering, which makes 1 OSD per
> host busy. My experience is, that the algorithm for selecting PGs to
> backfill/recover is not very smart. It could simply be that it doesn't find
> more PGs without violating some of these settings:
>
> osd_max_backfills
> osd_recovery_max_active
>
> I have never observed the second parameter to change anything (try any
> ways). However, the first one has a large impact. You could try increasing
> this slowly until recovery moves faster. Another parameter you might want
> to try is
>
> osd_recovery_sleep_[hdd|ssd]
>
> Be careful as this will impact client IO. I could reduce the sleep for my
> HDDs to 0.05. With your workload pattern, this might be something you can
> tune as well.
>
> Having said that, I think you should increase your PG count on the EC pool
> as soon as the cluster is healthy. You have only about 20 PGs per OSD and
> large PGs will take unnecessarily long to recover. A higher PG count will
> also make it easier for the scheduler to find PGs for recovery/backfill.
> Aim for a number between 100 and 200. Give the pool(s) with most data
> (#objects) the most PGs.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>>>>
> Sent: 24 June 2022 19:04
> To: Anthony D'Atri; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>>>
> Subject:  Re: Ceph recovery network speed
>
> 2 PG's shouldn't take hours to backfill in my opinion.  Just 2TB enterprise
> HD's.
>
> Take this log entry below, 72 minutes and still backfilling undersized?
> Should it be that slow?
>
> pg 12.15 is stuck undersized for 72m, current state
> active+undersized+degraded+remapped+backfilling, last acting
> [34,10,29,NONE]
>
> Thanks,
> Curt
>
>
> On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri <anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx><mailto:anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx>><mailto:anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx><mailto:anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx>>>
> <mailto:anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx><mailto:anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx>><mailto:anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx><mailto:anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx>>>>>
> wrote:
>
> > Your recovery is slow *because* there are only 2 PGs backfilling.
> >
> > What kind of OSD media are you using?
> >
> > > On Jun 24, 2022, at 09:46, Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>><mailto:
> lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>>>> wrote:
> > >
> > > Hello,
> > >
> > > I'm trying to understand why my recovery is so slow with only 2 pg
> > > backfilling.  I'm only getting speeds of 3-4/MiB/s on a 10G network.  I
> > > have tested the speed between machines with a few tools and all confirm
> > 10G
> > > speed.  I've tried changing various settings of priority and recovery
> > sleep
> > > hdd, but still the same. Is this a configuration issue or something
> else?
> > >
> > > It's just a small cluster right now with 4 hosts, 11 osd's per.  Please
> > let
> > > me know if you need more information.
> > >
> > > Thanks,
> > > Curt
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>><mailto:
> ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>>>
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>><mailto:
> ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>>>
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>>>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>><mailto:
> ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx