I appreciate all the tips! And thanks for the observation on weights. I don't know how it got to 1 for all OSDs. The custer has a mixture of 8 and 10T drives. Is there a way to automatically readjust them or this is done manually in the crush map (decompile/edit/compile)? I ran ceph osd crush reweight 75 1.0 and it started recovering right away 3-4 Gbit/s sustained throughput. I know this is a bandaid, waiting on your guidance on how to adjust the wrights above. Here is the requested additional output: # ceph -v ceph version 18.2.4 (..) reef (stable) NB: Once the cluster is stable and OK status, I plan to upgrade to 19.2.0 via ceph orch. # ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "replicated_rule", "type": 1, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 1, "rule_name": "fs01_data-ec", "type": 3, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -2, "item_name": "default~hdd" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 2, "rule_name": "central.rgw.buckets.data", "type": 3, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -2, "item_name": "default~hdd" }, { "op": "chooseleaf_indep", "num": 0, "type": "host" }, { "op": "emit" } ] } ] # ceph balancer status { "active": true, "last_optimize_duration": "0:00:00.000350", "last_optimize_started": "Wed Feb 26 14:01:03 2025", "mode": "upmap", "no_optimization_needed": true, "optimize_result": "Some objects (0.003469) are degraded; try again later", "plans": [] } On Wed, Feb 26, 2025 at 8:18 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > > On Feb 26, 2025, at 7:47 AM, Deep Dish <deeepdish@xxxxxxxxx> wrote: > > Your parents had quite the sense of humor. > > Hello, > > I have an 80 OSD cluster (across 8 nodes). The average utilization across > my OSDs is ~ 32%. > > > Average isn’t what factors in here ... > > Recently the cluster had a bad drive, and it was replaced (same > capacity). > > > 1TB HDDs? How old is this gear? > Oh, looks like your CRUSH weights don’t align with OSD TBs. Tricky. I > suspect your drives are …. 8TB? > > So the one thing that sticks out straight away is OSD.75 and it having a > different weight to all the other devices. > > > That sure doesn’t help. I suspect that for some reason the CRUSH weights > of all OSDs in the cluster were set to 1.0000 in the past. Which in and of > itself is … okay, as operationally CRUSH weights are *relative* to each > other. The replaced drive wasn’t brought up with that custom CRUSH weight, > so it has the default TiB CRUSH weight. > > As Frédéric suggests, do this NOW: > > ceph osd crush reweight osd.75 1.0000 > > This will back off your immediate problem. > > > >ceph osd reweight 75 1 > > Without `crush` in there this would actually be a no-op ;) > > You could set osd_crush_initial_weight = 1.0 to force all new OSDs to have > that 1.000 CRUSH weight, but that would bite you if you do legitimately add > larger drives down the road. > > I suggest reweighting all of your drives to 7.15359 at the same time by > decompiling and editing the CRUSH map to avoid future problems. > > Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each > drive > > For the past week or so the cluster has been > recovering, slowly, > > > Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each > drive, and `storcli64 /c0 show termlog`. > > See if there are any indications of one or more bad drives: lots of > reallocated sectors, SATA downshifts, etc. > > and reporting backfill_toofull. I can't figure out what's causing the > issue given there's ample available capacity. > > > Capacity and available capacity are different. > > Are you using EC? As wide as 8+2? > > usage: 197 TiB used, 413 TiB / 610 TiB avail > > > > recovery: 16 MiB/s, 4 objects/s > > Small clusters recover more slowly, but that’s pretty slow for an 80 OSD > cluster. Is this Reef or Squid with mclock? > > > > > # ceph osd df > > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META > AVAIL %USE VAR PGS STATUS > > > Please set your MUA to not wrap > > > 1 hdd 1.00000 1.00000 9.1 TiB 2.2 TiB 2.2 TiB 720 KiB 5.8 GiB > 6.9 TiB 24.28 0.75 108 up > > 9 hdd 1.00000 1.00000 7.3 TiB 2.7 TiB 2.7 TiB 20 MiB 8.8 GiB > 4.6 TiB 36.76 1.14 103 up > > 16 hdd 1.00000 1.00000 7.3 TiB 2.2 TiB 2.2 TiB 63 KiB 6.1 GiB > 5.1 TiB 29.82 0.92 109 up > > 27 hdd 1.00000 1.00000 9.1 TiB 2.4 TiB 2.4 TiB 1.9 MiB 6.5 GiB > 6.7 TiB 26.23 0.81 108 up > > 75 hdd 7.15359 1.00000 7.2 TiB 4.5 TiB 4.5 TiB 158 MiB 13 GiB > 2.6 TiB 63.47 1.96 356 up > > ... > TiB 32.01 0.99 105 up > > TOTAL 610 TiB 197 TiB 196 TiB 1.7 GiB 651 GiB > 413 > > > TiB 32.31 > > MIN/MAX VAR: 0.67/1.96 STDDEV: 5.72 > > > You don’t have a balancer enabled, or it isn’t working. Your available > space is a function not only of the *full ratios but of your replication > strategies and is relative to the *most full* OSD. > > Send `ceph osd crush rule dump` and `ceph balancer status` and `ceph -v` > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx