> On Feb 26, 2025, at 9:07 AM, Deep Dish <deeepdish@xxxxxxxxx> wrote: > > I appreciate all the tips! And thanks for the observation on weights. I don't know how it got to 1 for all OSDs. The custer has a mixture of 8 and 10T drives. Is there a way to automatically readjust them or this is done manually in the crush map (decompile/edit/compile)? You can do them individually with ceph osd crush reweight I didn’t suggest that mainly because when you have any OSDs that are currently marginal wrt fullness, issuing 80 of those commands serially is kind of a pain, and even when they’re run sequentially that’s 80 incremental changes, I don’t know that the cluster batches when it sends out map updates, maybe it does. But when any OSDs are close to full, increasing their CRUSH weights before the others *might* result in them getting too much data and going full. As other OSDs are reweighed, the PG mappings will change over and over. I would if you prefer put all the commands in a script and run the script so they’re executed in quick succession. Otherwise if you tell the cluster that some OSDs are 8x the size of others, you get what you have now. Which is why there’s a backfillfull guardrail. Your 8T drives should get 7.15359, assuming they’re all the same SKU, even if they are different SKUs they’re probably close in size. Note that Ceph is speaking TiB here while drive manufacturers rate in base-2 TB so they can claim a higher number. Weasels! Your 10T drives .. probably a value like 9.09495. If you want to be exact I would suggest setting them to that value, waiting for the dust to settle, then undeploying and redeploying one of them, it’ll come back with the exact CRUSH weight, which you can then retrofit to the others. I suspect that it won’t vary by more than 0.5 + / - 9.09495. > > I ran ceph osd crush reweight 75 1.0 and it started recovering right away 3-4 Gbit/s sustained throughput. I know this is a bandaid, waiting on your guidance on how to adjust the wrights above. > > Here is the requested additional output: > > # ceph -v > ceph version 18.2.4 (..) reef (stable) > > NB: Once the cluster is stable and OK status, I plan to upgrade to 19.2.0 via ceph orch. > > # ceph osd crush rule dump > [ > { > "rule_id": 0, > "rule_name": "replicated_rule", > "type": 1, > "steps": [ > { > "op": "take", > "item": -1, > "item_name": "default" > }, > { > "op": "chooseleaf_firstn", > "num": 0, > "type": "host" > }, > { > "op": "emit" > } > ] > }, > { > "rule_id": 1, > "rule_name": "fs01_data-ec", > "type": 3, > "steps": [ > { > "op": "set_chooseleaf_tries", > "num": 5 > }, > { > "op": "set_choose_tries", > "num": 100 > }, > { > "op": "take", > "item": -2, > "item_name": "default~hdd" > }, > { > "op": "chooseleaf_indep", > "num": 0, > "type": "host" > }, > { > "op": "emit" > } > ] > }, > { > "rule_id": 2, > "rule_name": "central.rgw.buckets.data", > "type": 3, > "steps": [ > { > "op": "set_chooseleaf_tries", > "num": 5 > }, > { > "op": "set_choose_tries", > "num": 100 > }, > { > "op": "take", > "item": -2, > "item_name": "default~hdd" > }, > { > "op": "chooseleaf_indep", > "num": 0, > "type": "host" > }, > { > "op": "emit" > } > ] > } > ] > > # ceph balancer status > { > "active": true, > "last_optimize_duration": "0:00:00.000350", > "last_optimize_started": "Wed Feb 26 14:01:03 2025", > "mode": "upmap", > "no_optimization_needed": true, > "optimize_result": "Some objects (0.003469) are degraded; try again later", > "plans": [] > } > > > On Wed, Feb 26, 2025 at 8:18 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx <mailto:aad@xxxxxxxxxxxxxx>> wrote: >> >> On Feb 26, 2025, at 7:47 AM, Deep Dish <deeepdish@xxxxxxxxx <mailto:deeepdish@xxxxxxxxx>> wrote: >> >> Your parents had quite the sense of humor. >> >>> Hello, >>> >>> I have an 80 OSD cluster (across 8 nodes). The average utilization across my OSDs is ~ 32%. >> >> Average isn’t what factors in here ... >> >>> Recently the cluster had a bad drive, and it was replaced (same capacity). >> >> 1TB HDDs? How old is this gear? >> Oh, looks like your CRUSH weights don’t align with OSD TBs. Tricky. I suspect your drives are …. 8TB? >> >>> So the one thing that sticks out straight away is OSD.75 and it having a different weight to all the other devices. >> >> That sure doesn’t help. I suspect that for some reason the CRUSH weights of all OSDs in the cluster were set to 1.0000 in the past. Which in and of itself is … okay, as operationally CRUSH weights are *relative* to each other. The replaced drive wasn’t brought up with that custom CRUSH weight, so it has the default TiB CRUSH weight. >> >> As Frédéric suggests, do this NOW: >> >> ceph osd crush reweight osd.75 1.0000 >> >> This will back off your immediate problem. >> >> >> >ceph osd reweight 75 1 >> >> Without `crush` in there this would actually be a no-op ;) >> >> You could set osd_crush_initial_weight = 1.0 to force all new OSDs to have that 1.000 CRUSH weight, but that would bite you if you do legitimately add larger drives down the road. >> >> I suggest reweighting all of your drives to 7.15359 at the same time by decompiling and editing the CRUSH map to avoid future problems. >> >> Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each drive >> >>> For the past week or so the cluster has been >>> recovering, slowly, >> >> Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each drive, and `storcli64 /c0 show termlog`. >> >> See if there are any indications of one or more bad drives: lots of reallocated sectors, SATA downshifts, etc. >> >>> and reporting backfill_toofull. I can't figure out what's causing the issue given there's ample available capacity. >> >> Capacity and available capacity are different. >> >> Are you using EC? As wide as 8+2? >> >>> usage: 197 TiB used, 413 TiB / 610 TiB avail >> >> > recovery: 16 MiB/s, 4 objects/s >> >> Small clusters recover more slowly, but that’s pretty slow for an 80 OSD cluster. Is this Reef or Squid with mclock? >> >> >>> >>> >>> # ceph osd df >>> >>> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META >>> AVAIL %USE VAR PGS STATUS >> >> Please set your MUA to not wrap >> >>> >>> 1 hdd 1.00000 1.00000 9.1 TiB 2.2 TiB 2.2 TiB 720 KiB 5.8 GiB 6.9 TiB 24.28 0.75 108 up >>> >>> 9 hdd 1.00000 1.00000 7.3 TiB 2.7 TiB 2.7 TiB 20 MiB 8.8 GiB 4.6 TiB 36.76 1.14 103 up >>> >>> 16 hdd 1.00000 1.00000 7.3 TiB 2.2 TiB 2.2 TiB 63 KiB 6.1 GiB 5.1 TiB 29.82 0.92 109 up >>> >>> 27 hdd 1.00000 1.00000 9.1 TiB 2.4 TiB 2.4 TiB 1.9 MiB 6.5 GiB 6.7 TiB 26.23 0.81 108 up >>> 75 hdd 7.15359 1.00000 7.2 TiB 4.5 TiB 4.5 TiB 158 MiB 13 GiB 2.6 TiB 63.47 1.96 356 up >>> ... >>> TiB 32.01 0.99 105 up >>> >>> TOTAL 610 TiB 197 TiB 196 TiB 1.7 GiB 651 GiB 413 >> >>> TiB 32.31 >>> >>> MIN/MAX VAR: 0.67/1.96 STDDEV: 5.72 >> >> You don’t have a balancer enabled, or it isn’t working. Your available space is a function not only of the *full ratios but of your replication strategies and is relative to the *most full* OSD. >> >> Send `ceph osd crush rule dump` and `ceph balancer status` and `ceph -v` >> >>> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx