> On Feb 26, 2025, at 9:07 AM, Deep Dish <deeepdish@xxxxxxxxx> wrote: > > I appreciate all the tips! And thanks for the observation on weights. I > don't know how it got to 1 for all OSDs. The custer has a mixture of 8 > and 10T drives. Is there a way to automatically readjust them or this is > done manually in the crush map (decompile/edit/compile)? > > I ran ceph osd crush reweight 75 1.0 and it started recovering right away > 3-4 Gbit/s sustained throughput. I know this is a bandaid, waiting on > your guidance on how to adjust the wrights above. > > Here is the requested additional output: > > # ceph -v > > ceph version 18.2.4 (..) reef (stable) > > > NB: Once the cluster is stable and OK status, I plan to upgrade to 19.2.0 > via ceph orch. 19.2.1 is the latest, if you want to go to Squid, go to 19.2.1, or wait for 19.2.2. > # ceph osd crush rule dump Do you have any pools using rule 0? Note that it does not specify a device class, but the others do. Most likely you do have multiple pools using rule 0, and that’s confusing the balancer. I would suggest recompiling the CRUSH map and adding ~hdd to `item_name` for rule 0 OR Create a new replicated CRUSH rule that specifies the hdd device class and change your pools to use that instead of rule 0 I suspect that would unblock your balancer. > > [ > > { > > "rule_id": 0, > > "rule_name": "replicated_rule", > > "type": 1, > > "steps": [ > > { > > "op": "take", > > "item": -1, > > "item_name": "default" > > }, > > { > > "op": "chooseleaf_firstn", > > "num": 0, > > "type": "host" > > }, > > { > > "op": "emit" > > } > > ] > > }, > > { > > "rule_id": 1, > > "rule_name": "fs01_data-ec", > > "type": 3, > > "steps": [ > > { > > "op": "set_chooseleaf_tries", > > "num": 5 > > }, > > { > > "op": "set_choose_tries", > > "num": 100 > > }, > > { > > "op": "take", > > "item": -2, > > "item_name": "default~hdd" > > }, > > { > > "op": "chooseleaf_indep", > > "num": 0, > > "type": "host" > > }, > > { > > "op": "emit" > > } > > ] > > }, > > { > > "rule_id": 2, > > "rule_name": "central.rgw.buckets.data", > > "type": 3, > > "steps": [ > > { > > "op": "set_chooseleaf_tries", > > "num": 5 > > }, > > { > > "op": "set_choose_tries", > > "num": 100 > > }, > > { > > "op": "take", > > "item": -2, > > "item_name": "default~hdd" > > }, > > { > > "op": "chooseleaf_indep", > > "num": 0, > > "type": "host" > > }, > > { > > "op": "emit" > > } > > ] > > } > > ] > > > # ceph balancer status > > { > > "active": true, > > "last_optimize_duration": "0:00:00.000350", > > "last_optimize_started": "Wed Feb 26 14:01:03 2025", > > "mode": "upmap", > > "no_optimization_needed": true, > > "optimize_result": "Some objects (0.003469) are degraded; try again > later", > > "plans": [] > > } > > > > On Wed, Feb 26, 2025 at 8:18 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > >> >> On Feb 26, 2025, at 7:47 AM, Deep Dish <deeepdish@xxxxxxxxx> wrote: >> >> Your parents had quite the sense of humor. >> >> Hello, >> >> I have an 80 OSD cluster (across 8 nodes). The average utilization across >> my OSDs is ~ 32%. >> >> >> Average isn’t what factors in here ... >> >> Recently the cluster had a bad drive, and it was replaced (same >> capacity). >> >> >> 1TB HDDs? How old is this gear? >> Oh, looks like your CRUSH weights don’t align with OSD TBs. Tricky. I >> suspect your drives are …. 8TB? >> >> So the one thing that sticks out straight away is OSD.75 and it having a >> different weight to all the other devices. >> >> >> That sure doesn’t help. I suspect that for some reason the CRUSH weights >> of all OSDs in the cluster were set to 1.0000 in the past. Which in and of >> itself is … okay, as operationally CRUSH weights are *relative* to each >> other. The replaced drive wasn’t brought up with that custom CRUSH weight, >> so it has the default TiB CRUSH weight. >> >> As Frédéric suggests, do this NOW: >> >> ceph osd crush reweight osd.75 1.0000 >> >> This will back off your immediate problem. >> >> >>> ceph osd reweight 75 1 >> >> Without `crush` in there this would actually be a no-op ;) >> >> You could set osd_crush_initial_weight = 1.0 to force all new OSDs to have >> that 1.000 CRUSH weight, but that would bite you if you do legitimately add >> larger drives down the road. >> >> I suggest reweighting all of your drives to 7.15359 at the same time by >> decompiling and editing the CRUSH map to avoid future problems. >> >> Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each >> drive >> >> For the past week or so the cluster has been >> recovering, slowly, >> >> >> Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each >> drive, and `storcli64 /c0 show termlog`. >> >> See if there are any indications of one or more bad drives: lots of >> reallocated sectors, SATA downshifts, etc. >> >> and reporting backfill_toofull. I can't figure out what's causing the >> issue given there's ample available capacity. >> >> >> Capacity and available capacity are different. >> >> Are you using EC? As wide as 8+2? >> >> usage: 197 TiB used, 413 TiB / 610 TiB avail >> >> >>> recovery: 16 MiB/s, 4 objects/s >> >> Small clusters recover more slowly, but that’s pretty slow for an 80 OSD >> cluster. Is this Reef or Squid with mclock? >> >> >> >> >> # ceph osd df >> >> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META >> AVAIL %USE VAR PGS STATUS >> >> >> Please set your MUA to not wrap >> >> >> 1 hdd 1.00000 1.00000 9.1 TiB 2.2 TiB 2.2 TiB 720 KiB 5.8 GiB >> 6.9 TiB 24.28 0.75 108 up >> >> 9 hdd 1.00000 1.00000 7.3 TiB 2.7 TiB 2.7 TiB 20 MiB 8.8 GiB >> 4.6 TiB 36.76 1.14 103 up >> >> 16 hdd 1.00000 1.00000 7.3 TiB 2.2 TiB 2.2 TiB 63 KiB 6.1 GiB >> 5.1 TiB 29.82 0.92 109 up >> >> 27 hdd 1.00000 1.00000 9.1 TiB 2.4 TiB 2.4 TiB 1.9 MiB 6.5 GiB >> 6.7 TiB 26.23 0.81 108 up >> >> 75 hdd 7.15359 1.00000 7.2 TiB 4.5 TiB 4.5 TiB 158 MiB 13 GiB >> 2.6 TiB 63.47 1.96 356 up >> >> ... >> TiB 32.01 0.99 105 up >> >> TOTAL 610 TiB 197 TiB 196 TiB 1.7 GiB 651 GiB >> 413 >> >> >> TiB 32.31 >> >> MIN/MAX VAR: 0.67/1.96 STDDEV: 5.72 >> >> >> You don’t have a balancer enabled, or it isn’t working. Your available >> space is a function not only of the *full ratios but of your replication >> strategies and is relative to the *most full* OSD. >> >> Send `ceph osd crush rule dump` and `ceph balancer status` and `ceph -v` >> >> >> >> > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx