Re: backfill_toofull not clearing on Reef

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Wed, 26 Feb 2025 09:41:55 -0500

> On Feb 26, 2025, at 9:07 AM, Deep Dish <deeepdish@xxxxxxxxx> wrote:
> 
> I appreciate all the tips!   And thanks for the observation on weights.   I
> don't know how it got to 1 for all OSDs.   The custer has a mixture of 8
> and 10T drives.   Is there a way to automatically readjust them or this is
> done manually in the crush map (decompile/edit/compile)?
> 
> I ran ceph osd crush reweight 75 1.0 and it started recovering right away
> 3-4 Gbit/s sustained throughput.   I know this is a bandaid, waiting on
> your guidance on how to adjust the wrights above.
> 
> Here is the requested additional output:
> 
> # ceph -v
> 
> ceph version 18.2.4 (..) reef (stable)
> 
> 
> NB:  Once the cluster is stable and OK status, I plan to upgrade to 19.2.0
> via ceph orch.

19.2.1 is the latest, if you want to go to Squid, go to 19.2.1, or wait for 19.2.2.

> # ceph osd crush rule dump

Do you have any pools using rule 0?  Note that it does not specify a device class, but the others do.  Most likely you do have multiple pools using rule 0, and that’s confusing the balancer.

I would suggest recompiling the CRUSH map and adding ~hdd to `item_name` for rule 0

OR

Create a new replicated CRUSH rule that specifies the hdd device class and change your pools to use that instead of rule 0

I suspect that would unblock your balancer.

> 
> [
> 
>    {
> 
>        "rule_id": 0,
> 
>        "rule_name": "replicated_rule",
> 
>        "type": 1,
> 
>        "steps": [
> 
>            {
> 
>                "op": "take",
> 
>                "item": -1,
> 
>                "item_name": "default"
> 
>            },
> 
>            {
> 
>                "op": "chooseleaf_firstn",
> 
>                "num": 0,
> 
>                "type": "host"
> 
>            },
> 
>            {
> 
>                "op": "emit"
> 
>            }
> 
>        ]
> 
>    },
> 
>    {
> 
>        "rule_id": 1,
> 
>        "rule_name": "fs01_data-ec",
> 
>        "type": 3,
> 
>        "steps": [
> 
>            {
> 
>                "op": "set_chooseleaf_tries",
> 
>                "num": 5
> 
>            },
> 
>            {
> 
>                "op": "set_choose_tries",
> 
>                "num": 100
> 
>            },
> 
>            {
> 
>                "op": "take",
> 
>                "item": -2,
> 
>                "item_name": "default~hdd"
> 
>            },
> 
>            {
> 
>                "op": "chooseleaf_indep",
> 
>                "num": 0,
> 
>                "type": "host"
> 
>            },
> 
>            {
> 
>                "op": "emit"
> 
>            }
> 
>        ]
> 
>    },
> 
>    {
> 
>        "rule_id": 2,
> 
>        "rule_name": "central.rgw.buckets.data",
> 
>        "type": 3,
> 
>        "steps": [
> 
>            {
> 
>                "op": "set_chooseleaf_tries",
> 
>                "num": 5
> 
>            },
> 
>            {
> 
>                "op": "set_choose_tries",
> 
>                "num": 100
> 
>            },
> 
>            {
> 
>                "op": "take",
> 
>                "item": -2,
> 
>                "item_name": "default~hdd"
> 
>            },
> 
>            {
> 
>                "op": "chooseleaf_indep",
> 
>                "num": 0,
> 
>                "type": "host"
> 
>            },
> 
>            {
> 
>                "op": "emit"
> 
>            }
> 
>        ]
> 
>    }
> 
> ]
> 
> 
> # ceph balancer status
> 
> {
> 
>    "active": true,
> 
>    "last_optimize_duration": "0:00:00.000350",
> 
>    "last_optimize_started": "Wed Feb 26 14:01:03 2025",
> 
>    "mode": "upmap",
> 
>    "no_optimization_needed": true,
> 
>    "optimize_result": "Some objects (0.003469) are degraded; try again
> later",
> 
>    "plans": []
> 
> }
> 
> 
> 
> On Wed, Feb 26, 2025 at 8:18 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
> 
>> 
>> On Feb 26, 2025, at 7:47 AM, Deep Dish <deeepdish@xxxxxxxxx> wrote:
>> 
>> Your parents had quite the sense of humor.
>> 
>> Hello,
>> 
>> I have an 80 OSD cluster (across 8 nodes).  The average utilization across
>> my OSDs is ~ 32%.
>> 
>> 
>> Average isn’t what factors in here ...
>> 
>>  Recently the cluster had a bad drive, and it was replaced (same
>> capacity).
>> 
>> 
>> 1TB HDDs? How old is this gear?
>> Oh, looks like your CRUSH weights don’t align with OSD TBs.  Tricky.  I
>> suspect your drives are …. 8TB?
>> 
>> So the one thing that sticks out straight away is OSD.75 and it having a
>> different weight to all the other devices.
>> 
>> 
>> That sure doesn’t help.  I suspect that for some reason the CRUSH weights
>> of all OSDs in the cluster were set to 1.0000 in the past.  Which in and of
>> itself is … okay, as operationally CRUSH weights are *relative* to each
>> other.  The replaced drive wasn’t brought up with that custom CRUSH weight,
>> so it has the default TiB CRUSH weight.
>> 
>> As Frédéric suggests, do this NOW:
>> 
>> ceph osd crush reweight osd.75 1.0000
>> 
>> This will back off your immediate problem.
>> 
>> 
>>> ceph osd reweight 75 1
>> 
>> Without `crush` in there this would actually be a no-op ;)
>> 
>> You could set osd_crush_initial_weight = 1.0 to force all new OSDs to have
>> that 1.000 CRUSH weight, but that would bite you if you do legitimately add
>> larger drives down the road.
>> 
>> I suggest reweighting all of your drives to 7.15359 at the same time by
>> decompiling and editing the CRUSH map to avoid future problems.
>> 
>> Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each
>> drive
>> 
>> For the past week or so the cluster has been
>> recovering, slowly,
>> 
>> 
>> Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each
>> drive, and `storcli64 /c0 show termlog`.
>> 
>> See if there are any indications of one or more bad drives:  lots of
>> reallocated sectors, SATA downshifts, etc.
>> 
>> and reporting backfill_toofull.   I can't figure out what's causing the
>> issue given there's ample available capacity.
>> 
>> 
>> Capacity and available capacity are different.
>> 
>> Are you using EC?  As wide as 8+2?
>> 
>>   usage:   197 TiB used, 413 TiB / 610 TiB avail
>> 
>> 
>>>  recovery: 16 MiB/s, 4 objects/s
>> 
>> Small clusters recover more slowly, but that’s pretty slow for an 80 OSD
>> cluster.  Is this Reef or Squid with mclock?
>> 
>> 
>> 
>> 
>> # ceph osd df
>> 
>> ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META
>> AVAIL    %USE   VAR   PGS  STATUS
>> 
>> 
>> Please set your MUA to not wrap
>> 
>> 
>> 1    hdd  1.00000   1.00000  9.1 TiB  2.2 TiB  2.2 TiB  720 KiB  5.8 GiB
>> 6.9 TiB  24.28  0.75  108      up
>> 
>> 9    hdd  1.00000   1.00000  7.3 TiB  2.7 TiB  2.7 TiB   20 MiB  8.8 GiB
>> 4.6 TiB  36.76  1.14  103      up
>> 
>> 16    hdd  1.00000   1.00000  7.3 TiB  2.2 TiB  2.2 TiB   63 KiB  6.1 GiB
>> 5.1 TiB  29.82  0.92  109      up
>> 
>> 27    hdd  1.00000   1.00000  9.1 TiB  2.4 TiB  2.4 TiB  1.9 MiB  6.5 GiB
>> 6.7 TiB  26.23  0.81  108      up
>> 
>> 75    hdd  7.15359   1.00000  7.2 TiB  4.5 TiB  4.5 TiB  158 MiB   13 GiB
>> 2.6 TiB  63.47  1.96  356      up
>> 
>> ...
>> TiB  32.01  0.99  105      up
>> 
>>                      TOTAL  610 TiB  197 TiB  196 TiB  1.7 GiB  651 GiB
>> 413
>> 
>> 
>> TiB  32.31
>> 
>> MIN/MAX VAR: 0.67/1.96  STDDEV: 5.72
>> 
>> 
>> You don’t have a balancer enabled, or it isn’t working.  Your available
>> space is a function not only of the *full ratios but of your replication
>> strategies and is relative to the *most full* OSD.
>> 
>> Send `ceph osd crush rule dump` and `ceph balancer status` and `ceph -v`
>> 
>> 
>> 
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx