Re: backfill_toofull not clearing on Reef

Deep Dish <deeepdish@xxxxxxxxx> · Wed, 26 Feb 2025 09:07:14 -0500

I appreciate all the tips!   And thanks for the observation on weights.   I
don't know how it got to 1 for all OSDs.   The custer has a mixture of 8
and 10T drives.   Is there a way to automatically readjust them or this is
done manually in the crush map (decompile/edit/compile)?

I ran ceph osd crush reweight 75 1.0 and it started recovering right away
3-4 Gbit/s sustained throughput.   I know this is a bandaid, waiting on
your guidance on how to adjust the wrights above.

Here is the requested additional output:

# ceph -v

ceph version 18.2.4 (..) reef (stable)

NB:  Once the cluster is stable and OK status, I plan to upgrade to 19.2.0
via ceph orch.

# ceph osd crush rule dump

[

    {

        "rule_id": 0,

        "rule_name": "replicated_rule",

        "type": 1,

        "steps": [

            {

                "op": "take",

                "item": -1,

                "item_name": "default"

            },

            {

                "op": "chooseleaf_firstn",

                "num": 0,

                "type": "host"

            },

            {

                "op": "emit"

            }

        ]

    },

    {

        "rule_id": 1,

        "rule_name": "fs01_data-ec",

        "type": 3,

        "steps": [

            {

                "op": "set_chooseleaf_tries",

                "num": 5

            },

            {

                "op": "set_choose_tries",

                "num": 100

            },

            {

                "op": "take",

                "item": -2,

                "item_name": "default~hdd"

            },

            {

                "op": "chooseleaf_indep",

                "num": 0,

                "type": "host"

            },

            {

                "op": "emit"

            }

        ]

    },

    {

        "rule_id": 2,

        "rule_name": "central.rgw.buckets.data",

        "type": 3,

        "steps": [

            {

                "op": "set_chooseleaf_tries",

                "num": 5

            },

            {

                "op": "set_choose_tries",

                "num": 100

            },

            {

                "op": "take",

                "item": -2,

                "item_name": "default~hdd"

            },

            {

                "op": "chooseleaf_indep",

                "num": 0,

                "type": "host"

            },

            {

                "op": "emit"

            }

        ]

    }

]

# ceph balancer status

{

    "active": true,

    "last_optimize_duration": "0:00:00.000350",

    "last_optimize_started": "Wed Feb 26 14:01:03 2025",

    "mode": "upmap",

    "no_optimization_needed": true,

    "optimize_result": "Some objects (0.003469) are degraded; try again
later",

    "plans": []

}

On Wed, Feb 26, 2025 at 8:18 AM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:

>
> On Feb 26, 2025, at 7:47 AM, Deep Dish <deeepdish@xxxxxxxxx> wrote:
>
> Your parents had quite the sense of humor.
>
> Hello,
>
> I have an 80 OSD cluster (across 8 nodes).  The average utilization across
> my OSDs is ~ 32%.
>
>
> Average isn’t what factors in here ...
>
>   Recently the cluster had a bad drive, and it was replaced (same
> capacity).
>
>
> 1TB HDDs? How old is this gear?
> Oh, looks like your CRUSH weights don’t align with OSD TBs.  Tricky.  I
> suspect your drives are …. 8TB?
>
> So the one thing that sticks out straight away is OSD.75 and it having a
> different weight to all the other devices.
>
>
> That sure doesn’t help.  I suspect that for some reason the CRUSH weights
> of all OSDs in the cluster were set to 1.0000 in the past.  Which in and of
> itself is … okay, as operationally CRUSH weights are *relative* to each
> other.  The replaced drive wasn’t brought up with that custom CRUSH weight,
> so it has the default TiB CRUSH weight.
>
> As Frédéric suggests, do this NOW:
>
> ceph osd crush reweight osd.75 1.0000
>
> This will back off your immediate problem.
>
>
> >ceph osd reweight 75 1
>
> Without `crush` in there this would actually be a no-op ;)
>
> You could set osd_crush_initial_weight = 1.0 to force all new OSDs to have
> that 1.000 CRUSH weight, but that would bite you if you do legitimately add
> larger drives down the road.
>
> I suggest reweighting all of your drives to 7.15359 at the same time by
> decompiling and editing the CRUSH map to avoid future problems.
>
> Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each
> drive
>
>  For the past week or so the cluster has been
> recovering, slowly,
>
>
> Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each
> drive, and `storcli64 /c0 show termlog`.
>
> See if there are any indications of one or more bad drives:  lots of
> reallocated sectors, SATA downshifts, etc.
>
> and reporting backfill_toofull.   I can't figure out what's causing the
> issue given there's ample available capacity.
>
>
> Capacity and available capacity are different.
>
> Are you using EC?  As wide as 8+2?
>
>    usage:   197 TiB used, 413 TiB / 610 TiB avail
>
>
> >   recovery: 16 MiB/s, 4 objects/s
>
> Small clusters recover more slowly, but that’s pretty slow for an 80 OSD
> cluster.  Is this Reef or Squid with mclock?
>
>
>
>
> # ceph osd df
>
> ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META
> AVAIL    %USE   VAR   PGS  STATUS
>
>
> Please set your MUA to not wrap
>
>
> 1    hdd  1.00000   1.00000  9.1 TiB  2.2 TiB  2.2 TiB  720 KiB  5.8 GiB
>  6.9 TiB  24.28  0.75  108      up
>
> 9    hdd  1.00000   1.00000  7.3 TiB  2.7 TiB  2.7 TiB   20 MiB  8.8 GiB
>  4.6 TiB  36.76  1.14  103      up
>
> 16    hdd  1.00000   1.00000  7.3 TiB  2.2 TiB  2.2 TiB   63 KiB  6.1 GiB
>  5.1 TiB  29.82  0.92  109      up
>
> 27    hdd  1.00000   1.00000  9.1 TiB  2.4 TiB  2.4 TiB  1.9 MiB  6.5 GiB
>  6.7 TiB  26.23  0.81  108      up
>
> 75    hdd  7.15359   1.00000  7.2 TiB  4.5 TiB  4.5 TiB  158 MiB   13 GiB
>  2.6 TiB  63.47  1.96  356      up
>
> ...
> TiB  32.01  0.99  105      up
>
>                       TOTAL  610 TiB  197 TiB  196 TiB  1.7 GiB  651 GiB
>  413
>
>
> TiB  32.31
>
> MIN/MAX VAR: 0.67/1.96  STDDEV: 5.72
>
>
> You don’t have a balancer enabled, or it isn’t working.  Your available
> space is a function not only of the *full ratios but of your replication
> strategies and is relative to the *most full* OSD.
>
> Send `ceph osd crush rule dump` and `ceph balancer status` and `ceph -v`
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx