On Wed, Jun 29, 2022 at 1:06 PM Frank Schilder <frans@xxxxxx> wrote: > Hi, > > did you wait for PG creation and peering to finish after setting pg_num > and pgp_num? They should be right on the value you set and not lower. > Yes, only thing going on was backfill. It's still just slowly expanding pg and pgp nums. I even ran the set command again. Here's the current info ceph osd pool get EC-22-Pool all size: 4 min_size: 3 pg_num: 226 pgp_num: 98 crush_rule: EC-22-Pool hashpspool: true allow_ec_overwrites: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 erasure_code_profile: EC-22-Pro fast_read: 0 pg_autoscale_mode: off eio: false bulk: false > > > How do you set the upmap balancer per pool? > > I'm afraid the answer is RTFM. I don't use it, but I believe to remember > one could configure it for equi-distribution of PGs for each pool. > > Ok, I'll dig around some more. I glanced at the balancer page and didn't see it. > Whenever you grow the cluster, you should make the same considerations > again and select numbers of PG per pool depending on number of objects, > capacity and performance. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Curt <lightspd@xxxxxxxxx> > Sent: 28 June 2022 16:33:24 > To: Frank Schilder > Cc: Robert Gallop; ceph-users@xxxxxxx > Subject: Re: Re: Ceph recovery network speed > > Hi Frank, > > Thank you for the thorough breakdown. I have increased the pg_num and > pgp_num to 1024 to start on the ec-22 pool. That is going to be my primary > pool with the most data. It looks like ceph slowly scales the pg up even > with autoscaling off, since I see target_pg_num 2048, pg_num 199. > > root@cephmgr:/# ceph osd pool set EC-22-Pool pg_num 2048 > set pool 12 pg_num to 2048 > root@cephmgr:/# ceph osd pool set EC-22-Pool pgp_num 2048 > set pool 12 pgp_num to 2048 > root@cephmgr:/# ceph osd pool get EC-22-Pool all > size: 4 > min_size: 3 > pg_num: 199 > pgp_num: 71 > crush_rule: EC-22-Pool > hashpspool: true > allow_ec_overwrites: true > nodelete: false > nopgchange: false > nosizechange: false > write_fadvise_dontneed: false > noscrub: false > nodeep-scrub: false > use_gmt_hitset: 1 > erasure_code_profile: EC-22-Pro > fast_read: 0 > pg_autoscale_mode: off > eio: false > bulk: false > > This cluster will be growing quit a bit over the next few months. I am > migrating data from their old Giant cluster to a new one, by the time I'm > done it should be 16 hosts with about 400TB of data. I'm guessing I'll have > to increase pg again later when I start adding more servers to the cluster. > > I will look into if SSD's are an option. How do you set the upmap > balancer per pool? Looking at ceph balancer status my mode is already > upmap. > > Thanks again, > Curt > > On Tue, Jun 28, 2022 at 1:23 AM Frank Schilder <frans@xxxxxx<mailto: > frans@xxxxxx>> wrote: > Hi Curt, > > looking at what you sent here, I believe you are the victim of "the law of > large numbers really only holds for large numbers". In other words, the > statistics of small samples is biting you. The PG numbers of your pools are > so low that they lead to a very large imbalance of data- and IO placement. > In other words, in your cluster a few OSDs receive the majority of IO > requests and bottleneck the entire cluster. > > If I see this correctly, the PG num per drive varies from 14 to 40. That's > an insane imbalance. Also, on your EC pool PG_num is 128 but PGP_num is > only 48. The autoscaler is screwing it up for you. It will slowly increase > the number of active PGs, causing continuous relocation of objects for a > very long time. > > I think the recovery speed you see for 8 objects per second is not too bad > considering that you have an HDD only cluster. The speed does not increase, > because it is a small number of PGs sending data - a subset of the 32 you > had before. In addition, due to the imbalance of PGs per OSD, only a small > number of PGs will be able to send data. You will need patience to get out > of this corner. > > The first thing I would do is look at which pools are important for your > workload in the long run. I see 2 pools having a significant number of > objects: EC-22-Pool and default.rgw.buckets.data. EC-22-Pool has about 40 > times the number of objects and bytes as default.rgw.buckets.data. I would > scale both up in PG count with emphasis on EC-22-Pool. > > Your cluster can safely operate between 1100 and 2200 PGs with replication > <=4. If you don't plan to create more large pools, a good choice of > distributing this capacity might be > > EC-22-Pool: 1024 PGs (could be pushed up to 2048) > default.rgw.buckets.data: 256 PGs > > That's towards the lower end of available PGs. Please make your own > calculation and judgement. > > If you have settled on target numbers, change the pool sizes in one go, > that is, set PG_num and PGP_num to the same value right away. You might > need to turn autoscaler off for these 2 pools. The rebalancing will take a > long time and also not speed up, because the few sending PGs are the > bottleneck, not the receiving ones. You will have to sit it out. > > The goal is that, in the future, recovery and re-balancing are improved. > In my experience, a reasonably high PG count will also reduce latency of > client IO. > > Next thing to look at is distribution of PGs per OSD. This has an enormous > performance impact, because a few too busy OSDs can throttle an entire > cluster (its always the slowest disk that wins). I use the very simple > reweight by utilization method, but my pools do not share OSDs as yours do. > You might want to try the upmap balancer per pool to get PGs per pool > evenly spread out over OSDs. > > Lastly, if you can afford it and your hosts have a slot left, consider > buying one enterprise SSD per host for the meta-data pools to get this IO > away from the HDDs. If you buy a bunch of 128G or 256G SATA SSDs, you can > probably place everything except the EC-22-Pool on these drives, separating > completely. > > Hope that helps and maybe someone else has ideas as well? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>> > Sent: 27 June 2022 21:36:27 > To: Frank Schilder > Cc: Robert Gallop; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> > Subject: Re: Re: Ceph recovery network speed > > On Mon, Jun 27, 2022 at 11:08 PM Frank Schilder <frans@xxxxxx<mailto: > frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>> wrote: > Do you, by any chance have SMR drives? This may not be stated on the > drive, check what the internet has to say. I also would have liked to see > the beginning of the ceph status, number of hosts, number of OSDs, up in > down whatever. Can you also send the result of ceph osd df tree? > As far as I can tell none of the drives are SMR drives. > I did have some inconsistent pop up, scrubs are still running. > > cluster: > id: 1684fe88-aae0-11ec-9593-df430e3982a0 > health: HEALTH_ERR > 10 scrub errors > Possible data damage: 4 pgs inconsistent > > services: > mon: 5 daemons, quorum cephmgr,cephmon1,cephmon2,cephmon3,cephmgr2 > (age 8w) > mgr: cephmon1.fxtvtu(active, since 2d), standbys: cephmon2.wrzwwn, > cephmgr2.hzsrdo, cephmgr.bazebq > osd: 44 osds: 44 up (since 3d), 44 in (since 3d); 28 remapped pgs > rgw: 2 daemons active (2 hosts, 1 zones) > > data: > pools: 11 pools, 369 pgs > objects: 2.45M objects, 9.2 TiB > usage: 21 TiB used, 59 TiB / 80 TiB avail > pgs: 503944/9729081 objects misplaced (5.180%) > 337 active+clean > 28 active+remapped+backfilling > 4 active+clean+inconsistent > > io: > client: 1000 KiB/s rd, 717 KiB/s wr, 81 op/s rd, 57 op/s wr > recovery: 34 MiB/s, 8 objects/s > > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META > AVAIL %USE VAR PGS STATUS TYPE NAME > -1 80.05347 - 80 TiB 21 TiB 21 TiB 32 MiB 69 > GiB 59 TiB 26.23 1.00 - root default > -5 20.01337 - 20 TiB 5.3 TiB 5.3 TiB 1.4 MiB 19 > GiB 15 TiB 26.47 1.01 - host hyperion01 > 1 hdd 1.81940 1.00000 1.8 TiB 749 GiB 747 GiB 224 KiB 2.2 > GiB 1.1 TiB 40.19 1.53 36 up osd.1 > 3 hdd 1.81940 1.00000 1.8 TiB 531 GiB 530 GiB 3 KiB 1.9 > GiB 1.3 TiB 28.52 1.09 31 up osd.3 > 5 hdd 1.81940 1.00000 1.8 TiB 167 GiB 166 GiB 36 KiB 1.2 > GiB 1.7 TiB 8.98 0.34 18 up osd.5 > 7 hdd 1.81940 1.00000 1.8 TiB 318 GiB 316 GiB 83 KiB 1.2 > GiB 1.5 TiB 17.04 0.65 26 up osd.7 > 9 hdd 1.81940 1.00000 1.8 TiB 1017 GiB 1014 GiB 139 KiB 2.6 > GiB 846 GiB 54.59 2.08 38 up osd.9 > 11 hdd 1.81940 1.00000 1.8 TiB 569 GiB 567 GiB 4 KiB 2.1 > GiB 1.3 TiB 30.56 1.17 29 up osd.11 > 13 hdd 1.81940 1.00000 1.8 TiB 293 GiB 291 GiB 338 KiB 1.5 > GiB 1.5 TiB 15.72 0.60 23 up osd.13 > 15 hdd 1.81940 1.00000 1.8 TiB 368 GiB 366 GiB 641 KiB 1.6 > GiB 1.5 TiB 19.74 0.75 23 up osd.15 > 17 hdd 1.81940 1.00000 1.8 TiB 369 GiB 367 GiB 2 KiB 1.5 > GiB 1.5 TiB 19.80 0.75 26 up osd.17 > 19 hdd 1.81940 1.00000 1.8 TiB 404 GiB 403 GiB 7 KiB 1.1 > GiB 1.4 TiB 21.69 0.83 31 up osd.19 > 45 hdd 1.81940 1.00000 1.8 TiB 639 GiB 637 GiB 2 KiB 2.0 > GiB 1.2 TiB 34.30 1.31 32 up osd.45 > -3 20.01337 - 20 TiB 5.2 TiB 5.2 TiB 2.0 MiB 18 > GiB 15 TiB 26.15 1.00 - host hyperion02 > 0 hdd 1.81940 1.00000 1.8 TiB 606 GiB 604 GiB 302 KiB 2.0 > GiB 1.2 TiB 32.52 1.24 33 up osd.0 > 2 hdd 1.81940 1.00000 1.8 TiB 58 GiB 58 GiB 112 KiB 249 > MiB 1.8 TiB 3.14 0.12 14 up osd.2 > 4 hdd 1.81940 1.00000 1.8 TiB 254 GiB 252 GiB 14 KiB 1.6 > GiB 1.6 TiB 13.63 0.52 28 up osd.4 > 6 hdd 1.81940 1.00000 1.8 TiB 574 GiB 572 GiB 1 KiB 1.8 > GiB 1.3 TiB 30.81 1.17 26 up osd.6 > 8 hdd 1.81940 1.00000 1.8 TiB 201 GiB 200 GiB 618 KiB 743 > MiB 1.6 TiB 10.77 0.41 23 up osd.8 > 10 hdd 1.81940 1.00000 1.8 TiB 628 GiB 626 GiB 4 KiB 2.2 > GiB 1.2 TiB 33.72 1.29 37 up osd.10 > 12 hdd 1.81940 1.00000 1.8 TiB 355 GiB 353 GiB 361 KiB 1.2 > GiB 1.5 TiB 19.03 0.73 30 up osd.12 > 14 hdd 1.81940 1.00000 1.8 TiB 1.1 TiB 1.1 TiB 1 KiB 2.7 > GiB 708 GiB 62.00 2.36 38 up osd.14 > 16 hdd 1.81940 1.00000 1.8 TiB 240 GiB 239 GiB 4 KiB 1.2 > GiB 1.6 TiB 12.90 0.49 20 up osd.16 > 18 hdd 1.81940 1.00000 1.8 TiB 300 GiB 298 GiB 542 KiB 1.6 > GiB 1.5 TiB 16.08 0.61 21 up osd.18 > 32 hdd 1.81940 1.00000 1.8 TiB 989 GiB 986 GiB 45 KiB 2.7 > GiB 874 GiB 53.09 2.02 36 up osd.32 > -7 20.01337 - 20 TiB 5.2 TiB 5.2 TiB 2.9 MiB 17 > GiB 15 TiB 26.06 0.99 - host hyperion03 > 22 hdd 1.81940 1.00000 1.8 TiB 449 GiB 448 GiB 443 KiB 1.5 > GiB 1.4 TiB 24.10 0.92 31 up osd.22 > 23 hdd 1.81940 1.00000 1.8 TiB 299 GiB 298 GiB 5 KiB 1.4 > GiB 1.5 TiB 16.05 0.61 26 up osd.23 > 24 hdd 1.81940 1.00000 1.8 TiB 735 GiB 733 GiB 8 KiB 2.3 > GiB 1.1 TiB 39.45 1.50 33 up osd.24 > 25 hdd 1.81940 1.00000 1.8 TiB 519 GiB 517 GiB 5 KiB 1.4 > GiB 1.3 TiB 27.85 1.06 26 up osd.25 > 26 hdd 1.81940 1.00000 1.8 TiB 483 GiB 481 GiB 614 KiB 1.7 > GiB 1.3 TiB 25.94 0.99 28 up osd.26 > 27 hdd 1.81940 1.00000 1.8 TiB 226 GiB 225 GiB 1.5 MiB 1.0 > GiB 1.6 TiB 12.11 0.46 17 up osd.27 > 28 hdd 1.81940 1.00000 1.8 TiB 443 GiB 441 GiB 24 KiB 1.5 > GiB 1.4 TiB 23.76 0.91 21 up osd.28 > 29 hdd 1.81940 1.00000 1.8 TiB 801 GiB 799 GiB 7 KiB 2.2 > GiB 1.0 TiB 42.98 1.64 31 up osd.29 > 30 hdd 1.81940 1.00000 1.8 TiB 523 GiB 522 GiB 174 KiB 1.2 > GiB 1.3 TiB 28.09 1.07 29 up osd.30 > 31 hdd 1.81940 1.00000 1.8 TiB 322 GiB 321 GiB 4 KiB 1.2 > GiB 1.5 TiB 17.30 0.66 26 up osd.31 > 44 hdd 1.81940 1.00000 1.8 TiB 541 GiB 540 GiB 136 KiB 1.4 > GiB 1.3 TiB 29.06 1.11 24 up osd.44 > -9 20.01337 - 20 TiB 5.3 TiB 5.2 TiB 25 MiB 16 > GiB 15 TiB 26.25 1.00 - host hyperion04 > 33 hdd 1.81940 1.00000 1.8 TiB 466 GiB 465 GiB 469 KiB 1.4 > GiB 1.4 TiB 25.02 0.95 28 up osd.33 > 34 hdd 1.81940 1.00000 1.8 TiB 508 GiB 506 GiB 2 KiB 1.8 > GiB 1.3 TiB 27.28 1.04 30 up osd.34 > 35 hdd 1.81940 1.00000 1.8 TiB 521 GiB 520 GiB 2 KiB 1.4 > GiB 1.3 TiB 27.98 1.07 32 up osd.35 > 36 hdd 1.81940 1.00000 1.8 TiB 872 GiB 870 GiB 3 KiB 2.3 > GiB 991 GiB 46.81 1.78 40 up osd.36 > 37 hdd 1.81940 1.00000 1.8 TiB 443 GiB 441 GiB 136 KiB 1.2 > GiB 1.4 TiB 23.75 0.91 25 up osd.37 > 38 hdd 1.81940 1.00000 1.8 TiB 138 GiB 137 GiB 24 MiB 647 > MiB 1.7 TiB 7.40 0.28 27 up osd.38 > 39 hdd 1.81940 1.00000 1.8 TiB 638 GiB 637 GiB 622 KiB 1.7 > GiB 1.2 TiB 34.26 1.31 33 up osd.39 > 40 hdd 1.81940 1.00000 1.8 TiB 444 GiB 443 GiB 14 KiB 1.4 > GiB 1.4 TiB 23.85 0.91 25 up osd.40 > 41 hdd 1.81940 1.00000 1.8 TiB 477 GiB 476 GiB 264 KiB 1.3 > GiB 1.4 TiB 25.60 0.98 31 up osd.41 > 42 hdd 1.81940 1.00000 1.8 TiB 514 GiB 513 GiB 35 KiB 1.2 > GiB 1.3 TiB 27.61 1.05 29 up osd.42 > 43 hdd 1.81940 1.00000 1.8 TiB 358 GiB 356 GiB 111 KiB 1.2 > GiB 1.5 TiB 19.19 0.73 24 up osd.43 > TOTAL 80 TiB 21 TiB 21 TiB 32 MiB 69 > GiB 59 TiB 26.23 > MIN/MAX VAR: 0.12/2.36 STDDEV: 12.47 > > The number of objects in flight looks small. Your objects seem to have an > average size of 4MB and should recover with full bandwidth. Check with top > how much IO wait percentage you have on the OSD hosts. > iowait is 3.3% and load avg is 3.7, nothing crazy from what I can tell. > > > The one thing that jumps to my eye though is, that you only have 22 dirty > PGs and they are all recovering/backfilling already. I wonder if you have a > problem with your crush rules, they might not do what you think they do. > You said you increased the PG count for EC-22-Pool to 128 (from what?) but > it doesn't really look like a suitable number of PGs has been marked for > backfilling. Can you post the output of "ceph osd pool get EC-22-Pool all"? > From 32 to 128 > ceph osd pool get EC-22-Pool all > size: 4 > min_size: 3 > pg_num: 128 > pgp_num: 48 > crush_rule: EC-22-Pool > hashpspool: true > allow_ec_overwrites: true > nodelete: false > nopgchange: false > nosizechange: false > write_fadvise_dontneed: false > noscrub: false > nodeep-scrub: false > use_gmt_hitset: 1 > erasure_code_profile: EC-22-Pro > fast_read: 0 > pg_autoscale_mode: on > eio: false > bulk: false > > > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto: > lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>> > Sent: 27 June 2022 19:41:06 > To: Robert Gallop > Cc: Frank Schilder; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto: > ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>> > Subject: Re: Re: Ceph recovery network speed > > I would love to see those types of speeds. I tried setting it all the way > to 0 and nothing, I did that before I sent the first email, maybe it was > your old post I got it from. > > osd_recovery_sleep_hdd 0.000000 > > > override (mon[0.000000]) > > On Mon, Jun 27, 2022 at 9:27 PM Robert Gallop <robert.gallop@xxxxxxxxx > <mailto:robert.gallop@xxxxxxxxx><mailto:robert.gallop@xxxxxxxxx<mailto: > robert.gallop@xxxxxxxxx>><mailto:robert.gallop@xxxxxxxxx<mailto: > robert.gallop@xxxxxxxxx><mailto:robert.gallop@xxxxxxxxx<mailto: > robert.gallop@xxxxxxxxx>>>> wrote: > I saw a major boost after having the sleep_hdd set to 0. Only after that > did I start staying at around 500MiB to 1.2GiB/sec and 1.5k obj/sec to 2.5k > obj/sec. > > Eventually it tapered back down, but for me sleep was the key, and > specifically in my case: > > osd_recovery_sleep_hdd > > On Mon, Jun 27, 2022 at 11:17 AM Curt <lightspd@xxxxxxxxx<mailto: > lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx > >><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto: > lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>>> wrote: > On Mon, Jun 27, 2022 at 8:52 PM Frank Schilder <frans@xxxxxx<mailto: > frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>><mailto: > frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>>>> > wrote: > > > I think this is just how ceph is. Maybe you should post the output of > > "ceph status", "ceph osd pool stats" and "ceph df" so that we can get an > > idea whether what you look at is expected or not. As I wrote before, > object > > recovery is throttled and the recovery bandwidth depends heavily on > object > > size. The interesting question is, how many objects per second are > > recovered/rebalanced > > > data: > pools: 11 pools, 369 pgs > objects: 2.45M objects, 9.2 TiB > usage: 20 TiB used, 60 TiB / 80 TiB avail > pgs: 512136/9729081 objects misplaced (5.264%) > 343 active+clean > 22 active+remapped+backfilling > > io: > client: 2.0 MiB/s rd, 344 KiB/s wr, 142 op/s rd, 69 op/s wr > recovery: 34 MiB/s, 8 objects/s > > Pool 12 is the only one with any stats. > > pool EC-22-Pool id 12 > 510048/9545052 objects misplaced (5.344%) > recovery io 36 MiB/s, 9 objects/s > client io 1.8 MiB/s rd, 404 KiB/s wr, 86 op/s rd, 72 op/s wr > > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd 80 TiB 60 TiB 20 TiB 20 TiB 25.45 > TOTAL 80 TiB 60 TiB 20 TiB 20 TiB 25.45 > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED MAX > AVAIL > .mgr 1 1 152 MiB 38 457 MiB 0 > 9.2 TiB > 21BadPool 3 32 8 KiB 1 12 KiB 0 > 18 TiB > .rgw.root 4 32 1.3 KiB 4 48 KiB 0 > 9.2 TiB > default.rgw.log 5 32 3.6 KiB 209 408 KiB 0 > 9.2 TiB > default.rgw.control 6 32 0 B 8 0 B 0 > 9.2 TiB > default.rgw.meta 7 8 6.7 KiB 20 203 KiB 0 > 9.2 TiB > rbd_rep_pool 8 32 2.0 MiB 5 5.9 MiB 0 > 9.2 TiB > default.rgw.buckets.index 9 8 2.0 MiB 33 5.9 MiB 0 > 9.2 TiB > default.rgw.buckets.non-ec 10 32 1.4 KiB 0 4.3 KiB 0 > 9.2 TiB > default.rgw.buckets.data 11 32 232 GiB 61.02k 697 GiB 2.41 > 9.2 TiB > EC-22-Pool 12 128 9.8 TiB 2.39M 20 TiB 41.55 > 14 TiB > > > > > Maybe provide the output of the first two commands for > > osd_recovery_sleep_hdd=0.05 and osd_recovery_sleep_hdd=0.1 each (wait a > bit > > after setting these and then collect the output). Include the applied > > values for osd_max_backfills* and osd_recovery_max_active* for one of the > > OSDs in the pool (ceph config show osd.ID | grep -e osd_max_backfills -e > > osd_recovery_max_active). > > > > I didn't notice any speed difference with sleep values changed, but I'll > grab the stats between changes when I have a chance. > > ceph config show osd.19 | egrep 'osd_max_backfills|osd_recovery_max_active' > osd_max_backfills 1000 > > > override mon[5] > osd_recovery_max_active 1000 > > > override > osd_recovery_max_active_hdd 1000 > > > override mon[5] > osd_recovery_max_active_ssd 1000 > > > override > > > > > I don't really know if on such a small cluster one can expect more than > > what you see. It has nothing to do with network speed if you have a 10G > > line. However, recovery is something completely different from a full > > link-speed copy. > > > > I can tell you that boatloads of tiny objects are a huge pain for > > recovery, even on SSD. Ceph doesn't raid up sections of disks against > each > > other, but object for object. This might be a feature request: that PG > > space allocation and recovery should follow the model of LVM extends > > (ideally match with LVM extends) to allow recovery/rebalancing larger > > chunks of storage in one go, containing parts of a large or many small > > objects. > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto: > lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>><mailto:lightspd@xxxxxxxxx > <mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto: > lightspd@xxxxxxxxx>>>> > > Sent: 27 June 2022 17:35:19 > > To: Frank Schilder > > Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto: > ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto:ceph-users@xxxxxxx > <mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto: > ceph-users@xxxxxxx>>> > > Subject: Re: Re: Ceph recovery network speed > > > > Hello, > > > > I had already increased/changed those variables previously. I increased > > the pg_num to 128. Which increased the number of PG's backfilling, but > > speed is still only at 30 MiB/s avg and has been backfilling 23 pg for > the > > last several hours. Should I increase it higher than 128? > > > > I'm still trying to figure out if this is just how ceph is or if there is > > a bottleneck somewhere. Like if I sftp a 10G file between servers it's > > done in a couple min or less. Am I thinking of this wrong? > > > > Thanks, > > Curt > > > > On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder <frans@xxxxxx<mailto: > frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx>><mailto: > frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto:frans@xxxxxx > >>><mailto: > > frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx<mailto: > frans@xxxxxx>><mailto:frans@xxxxxx<mailto:frans@xxxxxx><mailto: > frans@xxxxxx<mailto:frans@xxxxxx>>>>> wrote: > > Hi Curt, > > > > as far as I understood, a 2+2 EC pool is recovering, which makes 1 OSD > per > > host busy. My experience is, that the algorithm for selecting PGs to > > backfill/recover is not very smart. It could simply be that it doesn't > find > > more PGs without violating some of these settings: > > > > osd_max_backfills > > osd_recovery_max_active > > > > I have never observed the second parameter to change anything (try any > > ways). However, the first one has a large impact. You could try > increasing > > this slowly until recovery moves faster. Another parameter you might want > > to try is > > > > osd_recovery_sleep_[hdd|ssd] > > > > Be careful as this will impact client IO. I could reduce the sleep for my > > HDDs to 0.05. With your workload pattern, this might be something you can > > tune as well. > > > > Having said that, I think you should increase your PG count on the EC > pool > > as soon as the cluster is healthy. You have only about 20 PGs per OSD and > > large PGs will take unnecessarily long to recover. A higher PG count will > > also make it easier for the scheduler to find PGs for recovery/backfill. > > Aim for a number between 100 and 200. Give the pool(s) with most data > > (#objects) the most PGs. > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto: > lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>><mailto:lightspd@xxxxxxxxx > <mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto: > lightspd@xxxxxxxxx>>><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx > ><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>><mailto: > lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx > <mailto:lightspd@xxxxxxxxx>>>>> > > Sent: 24 June 2022 19:04 > > To: Anthony D'Atri; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx > ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto: > ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx > <mailto:ceph-users@xxxxxxx>>><mailto:ceph-users@xxxxxxx<mailto: > ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx > >><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto: > ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>>> > > Subject: Re: Ceph recovery network speed > > > > 2 PG's shouldn't take hours to backfill in my opinion. Just 2TB > enterprise > > HD's. > > > > Take this log entry below, 72 minutes and still backfilling undersized? > > Should it be that slow? > > > > pg 12.15 is stuck undersized for 72m, current state > > active+undersized+degraded+remapped+backfilling, last acting > > [34,10,29,NONE] > > > > Thanks, > > Curt > > > > > > On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri <anthony.datri@xxxxxxxxx > <mailto:anthony.datri@xxxxxxxxx><mailto:anthony.datri@xxxxxxxxx<mailto: > anthony.datri@xxxxxxxxx>><mailto:anthony.datri@xxxxxxxxx<mailto: > anthony.datri@xxxxxxxxx><mailto:anthony.datri@xxxxxxxxx<mailto: > anthony.datri@xxxxxxxxx>>> > > <mailto:anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx><mailto: > anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx>><mailto: > anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx><mailto: > anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx>>>>> > > wrote: > > > > > Your recovery is slow *because* there are only 2 PGs backfilling. > > > > > > What kind of OSD media are you using? > > > > > > > On Jun 24, 2022, at 09:46, Curt <lightspd@xxxxxxxxx<mailto: > lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx > >><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto: > lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>><mailto: > > lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx > <mailto:lightspd@xxxxxxxxx>><mailto:lightspd@xxxxxxxxx<mailto: > lightspd@xxxxxxxxx><mailto:lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>>>> > wrote: > > > > > > > > Hello, > > > > > > > > I'm trying to understand why my recovery is so slow with only 2 pg > > > > backfilling. I'm only getting speeds of 3-4/MiB/s on a 10G > network. I > > > > have tested the speed between machines with a few tools and all > confirm > > > 10G > > > > speed. I've tried changing various settings of priority and recovery > > > sleep > > > > hdd, but still the same. Is this a configuration issue or something > > else? > > > > > > > > It's just a small cluster right now with 4 hosts, 11 osd's per. > Please > > > let > > > > me know if you need more information. > > > > > > > > Thanks, > > > > Curt > > > > _______________________________________________ > > > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto: > ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx > >><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto: > ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>><mailto: > > ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx > <mailto:ceph-users@xxxxxxx>><mailto:ceph-users@xxxxxxx<mailto: > ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx > >>>> > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx>><mailto:ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx>>><mailto: > > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto: > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>><mailto: > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto: > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>>> > > > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx > ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto: > ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx > <mailto:ceph-users@xxxxxxx>>><mailto:ceph-users@xxxxxxx<mailto: > ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx > >><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto: > ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>>>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx>><mailto:ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx>>><mailto: > > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto: > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>><mailto: > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx><mailto: > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>>>> > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx > ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>><mailto: > ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx><mailto:ceph-users@xxxxxxx > <mailto:ceph-users@xxxxxxx>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx>><mailto:ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx><mailto:ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx>>> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx