On Mon, Jun 27, 2022 at 11:08 PM Frank Schilder <frans@xxxxxx> wrote: > Do you, by any chance have SMR drives? This may not be stated on the > drive, check what the internet has to say. I also would have liked to see > the beginning of the ceph status, number of hosts, number of OSDs, up in > down whatever. Can you also send the result of ceph osd df tree? > As far as I can tell none of the drives are SMR drives. I did have some inconsistent pop up, scrubs are still running. cluster: id: 1684fe88-aae0-11ec-9593-df430e3982a0 health: HEALTH_ERR 10 scrub errors Possible data damage: 4 pgs inconsistent services: mon: 5 daemons, quorum cephmgr,cephmon1,cephmon2,cephmon3,cephmgr2 (age 8w) mgr: cephmon1.fxtvtu(active, since 2d), standbys: cephmon2.wrzwwn, cephmgr2.hzsrdo, cephmgr.bazebq osd: 44 osds: 44 up (since 3d), 44 in (since 3d); 28 remapped pgs rgw: 2 daemons active (2 hosts, 1 zones) data: pools: 11 pools, 369 pgs objects: 2.45M objects, 9.2 TiB usage: 21 TiB used, 59 TiB / 80 TiB avail pgs: 503944/9729081 objects misplaced (5.180%) 337 active+clean 28 active+remapped+backfilling 4 active+clean+inconsistent io: client: 1000 KiB/s rd, 717 KiB/s wr, 81 op/s rd, 57 op/s wr recovery: 34 MiB/s, 8 objects/s ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 80.05347 - 80 TiB 21 TiB 21 TiB 32 MiB 69 GiB 59 TiB 26.23 1.00 - root default -5 20.01337 - 20 TiB 5.3 TiB 5.3 TiB 1.4 MiB 19 GiB 15 TiB 26.47 1.01 - host hyperion01 1 hdd 1.81940 1.00000 1.8 TiB 749 GiB 747 GiB 224 KiB 2.2 GiB 1.1 TiB 40.19 1.53 36 up osd.1 3 hdd 1.81940 1.00000 1.8 TiB 531 GiB 530 GiB 3 KiB 1.9 GiB 1.3 TiB 28.52 1.09 31 up osd.3 5 hdd 1.81940 1.00000 1.8 TiB 167 GiB 166 GiB 36 KiB 1.2 GiB 1.7 TiB 8.98 0.34 18 up osd.5 7 hdd 1.81940 1.00000 1.8 TiB 318 GiB 316 GiB 83 KiB 1.2 GiB 1.5 TiB 17.04 0.65 26 up osd.7 9 hdd 1.81940 1.00000 1.8 TiB 1017 GiB 1014 GiB 139 KiB 2.6 GiB 846 GiB 54.59 2.08 38 up osd.9 11 hdd 1.81940 1.00000 1.8 TiB 569 GiB 567 GiB 4 KiB 2.1 GiB 1.3 TiB 30.56 1.17 29 up osd.11 13 hdd 1.81940 1.00000 1.8 TiB 293 GiB 291 GiB 338 KiB 1.5 GiB 1.5 TiB 15.72 0.60 23 up osd.13 15 hdd 1.81940 1.00000 1.8 TiB 368 GiB 366 GiB 641 KiB 1.6 GiB 1.5 TiB 19.74 0.75 23 up osd.15 17 hdd 1.81940 1.00000 1.8 TiB 369 GiB 367 GiB 2 KiB 1.5 GiB 1.5 TiB 19.80 0.75 26 up osd.17 19 hdd 1.81940 1.00000 1.8 TiB 404 GiB 403 GiB 7 KiB 1.1 GiB 1.4 TiB 21.69 0.83 31 up osd.19 45 hdd 1.81940 1.00000 1.8 TiB 639 GiB 637 GiB 2 KiB 2.0 GiB 1.2 TiB 34.30 1.31 32 up osd.45 -3 20.01337 - 20 TiB 5.2 TiB 5.2 TiB 2.0 MiB 18 GiB 15 TiB 26.15 1.00 - host hyperion02 0 hdd 1.81940 1.00000 1.8 TiB 606 GiB 604 GiB 302 KiB 2.0 GiB 1.2 TiB 32.52 1.24 33 up osd.0 2 hdd 1.81940 1.00000 1.8 TiB 58 GiB 58 GiB 112 KiB 249 MiB 1.8 TiB 3.14 0.12 14 up osd.2 4 hdd 1.81940 1.00000 1.8 TiB 254 GiB 252 GiB 14 KiB 1.6 GiB 1.6 TiB 13.63 0.52 28 up osd.4 6 hdd 1.81940 1.00000 1.8 TiB 574 GiB 572 GiB 1 KiB 1.8 GiB 1.3 TiB 30.81 1.17 26 up osd.6 8 hdd 1.81940 1.00000 1.8 TiB 201 GiB 200 GiB 618 KiB 743 MiB 1.6 TiB 10.77 0.41 23 up osd.8 10 hdd 1.81940 1.00000 1.8 TiB 628 GiB 626 GiB 4 KiB 2.2 GiB 1.2 TiB 33.72 1.29 37 up osd.10 12 hdd 1.81940 1.00000 1.8 TiB 355 GiB 353 GiB 361 KiB 1.2 GiB 1.5 TiB 19.03 0.73 30 up osd.12 14 hdd 1.81940 1.00000 1.8 TiB 1.1 TiB 1.1 TiB 1 KiB 2.7 GiB 708 GiB 62.00 2.36 38 up osd.14 16 hdd 1.81940 1.00000 1.8 TiB 240 GiB 239 GiB 4 KiB 1.2 GiB 1.6 TiB 12.90 0.49 20 up osd.16 18 hdd 1.81940 1.00000 1.8 TiB 300 GiB 298 GiB 542 KiB 1.6 GiB 1.5 TiB 16.08 0.61 21 up osd.18 32 hdd 1.81940 1.00000 1.8 TiB 989 GiB 986 GiB 45 KiB 2.7 GiB 874 GiB 53.09 2.02 36 up osd.32 -7 20.01337 - 20 TiB 5.2 TiB 5.2 TiB 2.9 MiB 17 GiB 15 TiB 26.06 0.99 - host hyperion03 22 hdd 1.81940 1.00000 1.8 TiB 449 GiB 448 GiB 443 KiB 1.5 GiB 1.4 TiB 24.10 0.92 31 up osd.22 23 hdd 1.81940 1.00000 1.8 TiB 299 GiB 298 GiB 5 KiB 1.4 GiB 1.5 TiB 16.05 0.61 26 up osd.23 24 hdd 1.81940 1.00000 1.8 TiB 735 GiB 733 GiB 8 KiB 2.3 GiB 1.1 TiB 39.45 1.50 33 up osd.24 25 hdd 1.81940 1.00000 1.8 TiB 519 GiB 517 GiB 5 KiB 1.4 GiB 1.3 TiB 27.85 1.06 26 up osd.25 26 hdd 1.81940 1.00000 1.8 TiB 483 GiB 481 GiB 614 KiB 1.7 GiB 1.3 TiB 25.94 0.99 28 up osd.26 27 hdd 1.81940 1.00000 1.8 TiB 226 GiB 225 GiB 1.5 MiB 1.0 GiB 1.6 TiB 12.11 0.46 17 up osd.27 28 hdd 1.81940 1.00000 1.8 TiB 443 GiB 441 GiB 24 KiB 1.5 GiB 1.4 TiB 23.76 0.91 21 up osd.28 29 hdd 1.81940 1.00000 1.8 TiB 801 GiB 799 GiB 7 KiB 2.2 GiB 1.0 TiB 42.98 1.64 31 up osd.29 30 hdd 1.81940 1.00000 1.8 TiB 523 GiB 522 GiB 174 KiB 1.2 GiB 1.3 TiB 28.09 1.07 29 up osd.30 31 hdd 1.81940 1.00000 1.8 TiB 322 GiB 321 GiB 4 KiB 1.2 GiB 1.5 TiB 17.30 0.66 26 up osd.31 44 hdd 1.81940 1.00000 1.8 TiB 541 GiB 540 GiB 136 KiB 1.4 GiB 1.3 TiB 29.06 1.11 24 up osd.44 -9 20.01337 - 20 TiB 5.3 TiB 5.2 TiB 25 MiB 16 GiB 15 TiB 26.25 1.00 - host hyperion04 33 hdd 1.81940 1.00000 1.8 TiB 466 GiB 465 GiB 469 KiB 1.4 GiB 1.4 TiB 25.02 0.95 28 up osd.33 34 hdd 1.81940 1.00000 1.8 TiB 508 GiB 506 GiB 2 KiB 1.8 GiB 1.3 TiB 27.28 1.04 30 up osd.34 35 hdd 1.81940 1.00000 1.8 TiB 521 GiB 520 GiB 2 KiB 1.4 GiB 1.3 TiB 27.98 1.07 32 up osd.35 36 hdd 1.81940 1.00000 1.8 TiB 872 GiB 870 GiB 3 KiB 2.3 GiB 991 GiB 46.81 1.78 40 up osd.36 37 hdd 1.81940 1.00000 1.8 TiB 443 GiB 441 GiB 136 KiB 1.2 GiB 1.4 TiB 23.75 0.91 25 up osd.37 38 hdd 1.81940 1.00000 1.8 TiB 138 GiB 137 GiB 24 MiB 647 MiB 1.7 TiB 7.40 0.28 27 up osd.38 39 hdd 1.81940 1.00000 1.8 TiB 638 GiB 637 GiB 622 KiB 1.7 GiB 1.2 TiB 34.26 1.31 33 up osd.39 40 hdd 1.81940 1.00000 1.8 TiB 444 GiB 443 GiB 14 KiB 1.4 GiB 1.4 TiB 23.85 0.91 25 up osd.40 41 hdd 1.81940 1.00000 1.8 TiB 477 GiB 476 GiB 264 KiB 1.3 GiB 1.4 TiB 25.60 0.98 31 up osd.41 42 hdd 1.81940 1.00000 1.8 TiB 514 GiB 513 GiB 35 KiB 1.2 GiB 1.3 TiB 27.61 1.05 29 up osd.42 43 hdd 1.81940 1.00000 1.8 TiB 358 GiB 356 GiB 111 KiB 1.2 GiB 1.5 TiB 19.19 0.73 24 up osd.43 TOTAL 80 TiB 21 TiB 21 TiB 32 MiB 69 GiB 59 TiB 26.23 MIN/MAX VAR: 0.12/2.36 STDDEV: 12.47 > > The number of objects in flight looks small. Your objects seem to have an > average size of 4MB and should recover with full bandwidth. Check with top > how much IO wait percentage you have on the OSD hosts. > iowait is 3.3% and load avg is 3.7, nothing crazy from what I can tell. > > The one thing that jumps to my eye though is, that you only have 22 dirty > PGs and they are all recovering/backfilling already. I wonder if you have a > problem with your crush rules, they might not do what you think they do. > You said you increased the PG count for EC-22-Pool to 128 (from what?) but > it doesn't really look like a suitable number of PGs has been marked for > backfilling. Can you post the output of "ceph osd pool get EC-22-Pool all"? > >From 32 to 128 ceph osd pool get EC-22-Pool all size: 4 min_size: 3 pg_num: 128 pgp_num: 48 crush_rule: EC-22-Pool hashpspool: true allow_ec_overwrites: true nodelete: false nopgchange: false nosizechange: false write_fadvise_dontneed: false noscrub: false nodeep-scrub: false use_gmt_hitset: 1 erasure_code_profile: EC-22-Pro fast_read: 0 pg_autoscale_mode: on eio: false bulk: false > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Curt <lightspd@xxxxxxxxx> > Sent: 27 June 2022 19:41:06 > To: Robert Gallop > Cc: Frank Schilder; ceph-users@xxxxxxx > Subject: Re: Re: Ceph recovery network speed > > I would love to see those types of speeds. I tried setting it all the way > to 0 and nothing, I did that before I sent the first email, maybe it was > your old post I got it from. > > osd_recovery_sleep_hdd 0.000000 > > > override (mon[0.000000]) > > On Mon, Jun 27, 2022 at 9:27 PM Robert Gallop <robert.gallop@xxxxxxxxx > <mailto:robert.gallop@xxxxxxxxx>> wrote: > I saw a major boost after having the sleep_hdd set to 0. Only after that > did I start staying at around 500MiB to 1.2GiB/sec and 1.5k obj/sec to 2.5k > obj/sec. > > Eventually it tapered back down, but for me sleep was the key, and > specifically in my case: > > osd_recovery_sleep_hdd > > On Mon, Jun 27, 2022 at 11:17 AM Curt <lightspd@xxxxxxxxx<mailto: > lightspd@xxxxxxxxx>> wrote: > On Mon, Jun 27, 2022 at 8:52 PM Frank Schilder <frans@xxxxxx<mailto: > frans@xxxxxx>> wrote: > > > I think this is just how ceph is. Maybe you should post the output of > > "ceph status", "ceph osd pool stats" and "ceph df" so that we can get an > > idea whether what you look at is expected or not. As I wrote before, > object > > recovery is throttled and the recovery bandwidth depends heavily on > object > > size. The interesting question is, how many objects per second are > > recovered/rebalanced > > > data: > pools: 11 pools, 369 pgs > objects: 2.45M objects, 9.2 TiB > usage: 20 TiB used, 60 TiB / 80 TiB avail > pgs: 512136/9729081 objects misplaced (5.264%) > 343 active+clean > 22 active+remapped+backfilling > > io: > client: 2.0 MiB/s rd, 344 KiB/s wr, 142 op/s rd, 69 op/s wr > recovery: 34 MiB/s, 8 objects/s > > Pool 12 is the only one with any stats. > > pool EC-22-Pool id 12 > 510048/9545052 objects misplaced (5.344%) > recovery io 36 MiB/s, 9 objects/s > client io 1.8 MiB/s rd, 404 KiB/s wr, 86 op/s rd, 72 op/s wr > > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd 80 TiB 60 TiB 20 TiB 20 TiB 25.45 > TOTAL 80 TiB 60 TiB 20 TiB 20 TiB 25.45 > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED MAX > AVAIL > .mgr 1 1 152 MiB 38 457 MiB 0 > 9.2 TiB > 21BadPool 3 32 8 KiB 1 12 KiB 0 > 18 TiB > .rgw.root 4 32 1.3 KiB 4 48 KiB 0 > 9.2 TiB > default.rgw.log 5 32 3.6 KiB 209 408 KiB 0 > 9.2 TiB > default.rgw.control 6 32 0 B 8 0 B 0 > 9.2 TiB > default.rgw.meta 7 8 6.7 KiB 20 203 KiB 0 > 9.2 TiB > rbd_rep_pool 8 32 2.0 MiB 5 5.9 MiB 0 > 9.2 TiB > default.rgw.buckets.index 9 8 2.0 MiB 33 5.9 MiB 0 > 9.2 TiB > default.rgw.buckets.non-ec 10 32 1.4 KiB 0 4.3 KiB 0 > 9.2 TiB > default.rgw.buckets.data 11 32 232 GiB 61.02k 697 GiB 2.41 > 9.2 TiB > EC-22-Pool 12 128 9.8 TiB 2.39M 20 TiB 41.55 > 14 TiB > > > > > Maybe provide the output of the first two commands for > > osd_recovery_sleep_hdd=0.05 and osd_recovery_sleep_hdd=0.1 each (wait a > bit > > after setting these and then collect the output). Include the applied > > values for osd_max_backfills* and osd_recovery_max_active* for one of the > > OSDs in the pool (ceph config show osd.ID | grep -e osd_max_backfills -e > > osd_recovery_max_active). > > > > I didn't notice any speed difference with sleep values changed, but I'll > grab the stats between changes when I have a chance. > > ceph config show osd.19 | egrep 'osd_max_backfills|osd_recovery_max_active' > osd_max_backfills 1000 > > > override mon[5] > osd_recovery_max_active 1000 > > > override > osd_recovery_max_active_hdd 1000 > > > override mon[5] > osd_recovery_max_active_ssd 1000 > > > override > > > > > I don't really know if on such a small cluster one can expect more than > > what you see. It has nothing to do with network speed if you have a 10G > > line. However, recovery is something completely different from a full > > link-speed copy. > > > > I can tell you that boatloads of tiny objects are a huge pain for > > recovery, even on SSD. Ceph doesn't raid up sections of disks against > each > > other, but object for object. This might be a feature request: that PG > > space allocation and recovery should follow the model of LVM extends > > (ideally match with LVM extends) to allow recovery/rebalancing larger > > chunks of storage in one go, containing parts of a large or many small > > objects. > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>> > > Sent: 27 June 2022 17:35:19 > > To: Frank Schilder > > Cc: ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> > > Subject: Re: Re: Ceph recovery network speed > > > > Hello, > > > > I had already increased/changed those variables previously. I increased > > the pg_num to 128. Which increased the number of PG's backfilling, but > > speed is still only at 30 MiB/s avg and has been backfilling 23 pg for > the > > last several hours. Should I increase it higher than 128? > > > > I'm still trying to figure out if this is just how ceph is or if there is > > a bottleneck somewhere. Like if I sftp a 10G file between servers it's > > done in a couple min or less. Am I thinking of this wrong? > > > > Thanks, > > Curt > > > > On Mon, Jun 27, 2022 at 12:33 PM Frank Schilder <frans@xxxxxx<mailto: > frans@xxxxxx><mailto: > > frans@xxxxxx<mailto:frans@xxxxxx>>> wrote: > > Hi Curt, > > > > as far as I understood, a 2+2 EC pool is recovering, which makes 1 OSD > per > > host busy. My experience is, that the algorithm for selecting PGs to > > backfill/recover is not very smart. It could simply be that it doesn't > find > > more PGs without violating some of these settings: > > > > osd_max_backfills > > osd_recovery_max_active > > > > I have never observed the second parameter to change anything (try any > > ways). However, the first one has a large impact. You could try > increasing > > this slowly until recovery moves faster. Another parameter you might want > > to try is > > > > osd_recovery_sleep_[hdd|ssd] > > > > Be careful as this will impact client IO. I could reduce the sleep for my > > HDDs to 0.05. With your workload pattern, this might be something you can > > tune as well. > > > > Having said that, I think you should increase your PG count on the EC > pool > > as soon as the cluster is healthy. You have only about 20 PGs per OSD and > > large PGs will take unnecessarily long to recover. A higher PG count will > > also make it easier for the scheduler to find PGs for recovery/backfill. > > Aim for a number between 100 and 200. Give the pool(s) with most data > > (#objects) the most PGs. > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Curt <lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx><mailto: > lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>> > > Sent: 24 June 2022 19:04 > > To: Anthony D'Atri; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx > ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>> > > Subject: Re: Ceph recovery network speed > > > > 2 PG's shouldn't take hours to backfill in my opinion. Just 2TB > enterprise > > HD's. > > > > Take this log entry below, 72 minutes and still backfilling undersized? > > Should it be that slow? > > > > pg 12.15 is stuck undersized for 72m, current state > > active+undersized+degraded+remapped+backfilling, last acting > > [34,10,29,NONE] > > > > Thanks, > > Curt > > > > > > On Fri, Jun 24, 2022 at 8:53 PM Anthony D'Atri <anthony.datri@xxxxxxxxx > <mailto:anthony.datri@xxxxxxxxx> > > <mailto:anthony.datri@xxxxxxxxx<mailto:anthony.datri@xxxxxxxxx>>> > > wrote: > > > > > Your recovery is slow *because* there are only 2 PGs backfilling. > > > > > > What kind of OSD media are you using? > > > > > > > On Jun 24, 2022, at 09:46, Curt <lightspd@xxxxxxxxx<mailto: > lightspd@xxxxxxxxx><mailto: > > lightspd@xxxxxxxxx<mailto:lightspd@xxxxxxxxx>>> wrote: > > > > > > > > Hello, > > > > > > > > I'm trying to understand why my recovery is so slow with only 2 pg > > > > backfilling. I'm only getting speeds of 3-4/MiB/s on a 10G > network. I > > > > have tested the speed between machines with a few tools and all > confirm > > > 10G > > > > speed. I've tried changing various settings of priority and recovery > > > sleep > > > > hdd, but still the same. Is this a configuration issue or something > > else? > > > > > > > > It's just a small cluster right now with 4 hosts, 11 osd's per. > Please > > > let > > > > me know if you need more information. > > > > > > > > Thanks, > > > > Curt > > > > _______________________________________________ > > > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto: > ceph-users@xxxxxxx><mailto: > > ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>> > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx><mailto: > > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>> > > > > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx > ><mailto:ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx><mailto: > > ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>> > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto: > ceph-users-leave@xxxxxxx> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx