Re: backfill_toofull not clearing on Reef

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Wed, 26 Feb 2025 08:17:53 -0500

On Feb 26, 2025, at 7:47 AM, Deep Dish <deeepdish@xxxxxxxxx> wrote:

Your parents had quite the sense of humor.

> Hello,
> 
> I have an 80 OSD cluster (across 8 nodes).  The average utilization across my OSDs is ~ 32%.

Average isn’t what factors in here ...

>   Recently the cluster had a bad drive, and it was replaced (same capacity).

1TB HDDs? How old is this gear? 
Oh, looks like your CRUSH weights don’t align with OSD TBs.  Tricky.  I suspect your drives are …. 8TB?

> So the one thing that sticks out straight away is OSD.75 and it having a different weight to all the other devices.

That sure doesn’t help.  I suspect that for some reason the CRUSH weights of all OSDs in the cluster were set to 1.0000 in the past.  Which in and of itself is … okay, as operationally CRUSH weights are *relative* to each other.  The replaced drive wasn’t brought up with that custom CRUSH weight, so it has the default TiB CRUSH weight.

As Frédéric suggests, do this NOW:

	ceph osd crush reweight osd.75 1.0000

This will back off your immediate problem.

>ceph osd reweight 75 1

Without `crush` in there this would actually be a no-op ;)

You could set osd_crush_initial_weight = 1.0 to force all new OSDs to have that 1.000 CRUSH weight, but that would bite you if you do legitimately add larger drives down the road.

I suggest reweighting all of your drives to 7.15359 at the same time by decompiling and editing the CRUSH map to avoid future problems.

Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each drive

>  For the past week or so the cluster has been
> recovering, slowly,

Look at `dmesg` / `/var/log/messages` on each host, `smartctl -a` for each drive, and `storcli64 /c0 show termlog`.

See if there are any indications of one or more bad drives:  lots of reallocated sectors, SATA downshifts, etc.

> and reporting backfill_toofull.   I can't figure out what's causing the issue given there's ample available capacity.

Capacity and available capacity are different.

Are you using EC?  As wide as 8+2?

>    usage:   197 TiB used, 413 TiB / 610 TiB avail

>   recovery: 16 MiB/s, 4 objects/s

Small clusters recover more slowly, but that’s pretty slow for an 80 OSD cluster.  Is this Reef or Squid with mclock?

> 
> 
> # ceph osd df
> 
> ID  CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA     OMAP     META
> AVAIL    %USE   VAR   PGS  STATUS

Please set your MUA to not wrap 

> 
> 1    hdd  1.00000   1.00000  9.1 TiB  2.2 TiB  2.2 TiB  720 KiB  5.8 GiB  6.9 TiB  24.28  0.75  108      up
> 
> 9    hdd  1.00000   1.00000  7.3 TiB  2.7 TiB  2.7 TiB   20 MiB  8.8 GiB  4.6 TiB  36.76  1.14  103      up
> 
> 16    hdd  1.00000   1.00000  7.3 TiB  2.2 TiB  2.2 TiB   63 KiB  6.1 GiB  5.1 TiB  29.82  0.92  109      up
> 
> 27    hdd  1.00000   1.00000  9.1 TiB  2.4 TiB  2.4 TiB  1.9 MiB  6.5 GiB  6.7 TiB  26.23  0.81  108      up
> 75    hdd  7.15359   1.00000  7.2 TiB  4.5 TiB  4.5 TiB  158 MiB   13 GiB  2.6 TiB  63.47  1.96  356      up
> ...
> TiB  32.01  0.99  105      up
> 
>                       TOTAL  610 TiB  197 TiB  196 TiB  1.7 GiB  651 GiB  413

> TiB  32.31
> 
> MIN/MAX VAR: 0.67/1.96  STDDEV: 5.72

You don’t have a balancer enabled, or it isn’t working.  Your available space is a function not only of the *full ratios but of your replication strategies and is relative to the *most full* OSD.

Send `ceph osd crush rule dump` and `ceph balancer status` and `ceph -v`

> 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx