Re: backfill_toofull not clearing on Reef

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Wed, 26 Feb 2025 09:38:40 -0500

> On Feb 26, 2025, at 9:07 AM, Deep Dish <deeepdish@xxxxxxxxx> wrote:
> 
> I appreciate all the tips!   And thanks for the observation on weights.   I don't know how it got to 1 for all OSDs.   The custer has a mixture of 8 and 10T drives.   Is there a way to automatically readjust them or this is done manually in the crush map (decompile/edit/compile)?

You can do them individually with 

		ceph osd crush reweight

I didn’t suggest that mainly because when you have any OSDs that are currently marginal wrt fullness, issuing 80 of those commands serially is kind of a pain, and even when they’re run sequentially that’s 80 incremental changes, I don’t know that the cluster batches when it sends out map updates, maybe it does.  But when any OSDs are close to full, increasing their CRUSH weights before the others *might* result in them getting too much data and going full.

As other OSDs are reweighed, the PG mappings will change over and over.  I would if you prefer put all the commands in a script and run the script so they’re executed in quick succession.  Otherwise if you tell the cluster that some OSDs are 8x the size of others, you get what you have now.  Which is why there’s a backfillfull guardrail.

Your 8T drives should get 7.15359, assuming they’re all the same SKU, even if they are different SKUs they’re probably close in size.  Note that Ceph is speaking TiB here while drive manufacturers rate in base-2 TB so they can claim a higher number.  Weasels!

Your 10T drives .. probably a value like 9.09495.  If you want to be exact I would suggest setting them to that value, waiting for the dust to settle, then undeploying and redeploying one of them, it’ll come back with the exact CRUSH weight, which you can then retrofit to the others.  I suspect that it won’t vary by more than 0.5 + / - 9.09495.

> 
> I ran ceph osd crush reweight 75 1.0 and it started recovering right away 3-4 Gbit/s sustained throughput.   I know this is a bandaid, waiting on your guidance on how to adjust the wrights above. 
> 
> Here is the requested additional output:
> 
> # ceph -v
> ceph version 18.2.4 (..) reef (stable)
> 
> NB:  Once the cluster is stable and OK status, I plan to upgrade to 19.2.0 via ceph orch.
> 
> # ceph osd crush rule dump
> [
>     {
>         "rule_id": 0,
>         "rule_name": "replicated_rule",
>         "type": 1,
>         "steps": [
>             {
>                 "op": "take",
>                 "item": -1,
>                 "item_name": "default"
>             },
>             {
>                 "op": "chooseleaf_firstn",
>                 "num": 0,
>                 "type": "host"
>             },
>             {
>                 "op": "emit"
>             }
>         ]
>     },
>     {
>         "rule_id": 1,
>         "rule_name": "fs01_data-ec",
>         "type": 3,
>         "steps": [
>             {
>                 "op": "set_chooseleaf_tries",
>                 "num": 5
>             },
>             {
>                 "op": "set_choose_tries",
>                 "num": 100
>             },
>             {
>                 "op": "take",
>                 "item": -2,
>                 "item_name": "default~hdd"
>             },
>             {
>                 "op": "chooseleaf_indep",
>                 "num": 0,
>                 "type": "host"
>             },
>             {
>                 "op": "emit"
>             }
>         ]
>     },
>     {
>         "rule_id": 2,
>         "rule_name": "central.rgw.buckets.data",
>         "type": 3,
>         "steps": [
>             {
>                 "op": "set_chooseleaf_tries",
>                 "num": 5
>             },
>             {
>                 "op": "set_choose_tries",
>                 "num": 100
>             },
>             {
>                 "op": "take",
>                 "item": -2,
>                 "item_name": "default~hdd"
>             },
>             {
>                 "op": "chooseleaf_indep",
>                 "num": 0,
>                 "type": "host"
>             },
>             {
>                 "op": "emit"
>             }
>         ]
>     }
> ]
> 
> # ceph balancer status
> {
>     "active": true,
>     "last_optimize_duration": "0:00:00.000350",
>     "last_optimize_started": "Wed Feb 26 14:01:03 2025",
>     "mode": "upmap",
>     "no_optimization_needed": true,
>     "optimize_result": "Some objects (0.003469) are degraded; try again later",
>     "plans": []
> }
> 
> 
> On Wed, Feb 26, 2025 at 8:18 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx <mailto:aad@xxxxxxxxxxxxxx>> wrote:
>> 
>> On Feb 26, 2025, at 7:47 AM, Deep Dish <deeepdish@xxxxxxxxx <mailto:deeepdish@xxxxxxxxx>> wrote:
>> 
>> Your parents had quite the sense of humor.
>> 
>>> Hello,
>>> 
>>> I have an 80 OSD cluster (across 8 nodes).  The average utilization across my OSDs is ~ 32%.
>> 
>> Average isn’t what factors in here ...
>> 
>>>   Recently the cluster had a bad drive, and it was replaced (same capacity).
>> 
>> 1TB HDDs? How old is this gear? 
>> Oh, looks like your CRUSH weights don’t align with OSD TBs.  Tricky.  I suspect your drives are …. 8TB?
>> 
>>> So the one thing that sticks out straight away is OSD.75 and it having a different weight to all the other devices.
>> 
>> That sure doesn’t help.  I suspect that for some reason the CRUSH weights of all OSDs in the cluster were set to 1.0000 in the past.  Which in and of itself is … okay, as operationally CRUSH weights are *relative* to each other.  The replaced drive wasn’t brought up with that custom CRUSH weight, so it has the default TiB CRUSH weight.
>> 
>> As Frédéric suggests, do this NOW:
>> 
>> 	ceph osd crush reweight osd.75 1.0000
>> 
>> This will back off your immediate problem.
>> 
>> 
>> >ceph osd reweight 75 1
>> 
>> Without `crush` in there this would actually be a no-op ;)
>> 
>> You could set osd_crush_initial_weight = 1.0 to force all new OSDs to have that 1.000 CRUSH weight, but that would bite you if you do legitimately add larger drives down the road.
>> 
>> I suggest reweighting all of your drives to 7.15359 at the same time by decompiling and editing the CRUSH map to avoid future problems.
>> 
>> Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each drive
>> 
>>>  For the past week or so the cluster has been
>>> recovering, slowly,
>> 
>> Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each drive, and `storcli64 /c0 show termlog`.
>> 
>> See if there are any indications of one or more bad drives:  lots of reallocated sectors, SATA downshifts, etc.
>> 
>>> and reporting backfill_toofull.   I can't figure out what's causing the issue given there's ample available capacity.
>> 
>> Capacity and available capacity are different.
>> 
>> Are you using EC?  As wide as 8+2?
>> 
>>>    usage:   197 TiB used, 413 TiB / 610 TiB avail
>> 
>> >   recovery: 16 MiB/s, 4 objects/s
>> 
>> Small clusters recover more slowly, but that’s pretty slow for an 80 OSD cluster.  Is this Reef or Squid with mclock?
>> 
>> 
>>> 
>>> 
>>> # ceph osd df
>>> 
>>> ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META
>>> AVAIL    %USE   VAR   PGS  STATUS
>> 
>> Please set your MUA to not wrap 
>> 
>>> 
>>> 1    hdd  1.00000   1.00000  9.1 TiB  2.2 TiB  2.2 TiB  720 KiB  5.8 GiB  6.9 TiB  24.28  0.75  108      up
>>> 
>>> 9    hdd  1.00000   1.00000  7.3 TiB  2.7 TiB  2.7 TiB   20 MiB  8.8 GiB  4.6 TiB  36.76  1.14  103      up
>>> 
>>> 16    hdd  1.00000   1.00000  7.3 TiB  2.2 TiB  2.2 TiB   63 KiB  6.1 GiB  5.1 TiB  29.82  0.92  109      up
>>> 
>>> 27    hdd  1.00000   1.00000  9.1 TiB  2.4 TiB  2.4 TiB  1.9 MiB  6.5 GiB  6.7 TiB  26.23  0.81  108      up
>>> 75    hdd  7.15359   1.00000  7.2 TiB  4.5 TiB  4.5 TiB  158 MiB   13 GiB  2.6 TiB  63.47  1.96  356      up
>>> ...
>>> TiB  32.01  0.99  105      up
>>> 
>>>                       TOTAL  610 TiB  197 TiB  196 TiB  1.7 GiB  651 GiB  413
>> 
>>> TiB  32.31
>>> 
>>> MIN/MAX VAR: 0.67/1.96  STDDEV: 5.72
>> 
>> You don’t have a balancer enabled, or it isn’t working.  Your available space is a function not only of the *full ratios but of your replication strategies and is relative to the *most full* OSD.
>> 
>> Send `ceph osd crush rule dump` and `ceph balancer status` and `ceph -v`
>> 
>>> 
>> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx