Re: EC pool OSDs getting erroneously "full" (15.2.15)

Nikola Ciprich <nikola.ciprich@xxxxxxxxxxx> · Wed, 20 Apr 2022 13:50:01 +0200

Hi Stefan,

all daemons are 15.2.15 (I'm considering doing update to 15.2.16 today)

> What do you have set as neafull ratio? ceph osd dump |grep nearfull.
nearfull is 0.87
> 
> Do you have the ceph balancer enabled? ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:00.000538",
    "last_optimize_started": "Wed Apr 20 13:02:26 2022",
    "mode": "crush-compat",
    "optimize_result": "Some objects (0.130412) are degraded; try again later",
    "plans": []
}

> What kind of maintenance was going on?
we were replacing failing memory module (according to IPMI log, all errors
were corrected though..)

> 
> Are the PGs on those OSDs *way* bigger than on those of the other nodes?
> ceph pg ls-by-osd $osd-id and check for bytes (and OMAP bytes). Only
> accurate information when PGs have been recently deep-scrubbed.
sizes seem to be ~similar (each pg is between 65-75GB), if I count sum of them,
it's almost twice as big for osd.5 as for osd.53-osd.55
it hasn't been scrubbed due to ongoing recovery though.. but the OMAP
sizes shouldn't make such a difference..

> 
> In this case the PG backfilltoofull warning(s) might have been correct.
> Yesterday though, I had no OSDs close to near full ratio and was getting the
> same PG backfilltoofull message ... previously seen due to this bug [1]. I
> could fix that by setting upmaps for the affacted PGs to another OSD.
warning is correct, but the usage value seems to be wrong..

what comes to my mind, there seem to be a lot of pgs waiting for snaptrims..
I'll keep it snaptrimming for some time and see if usage lowers...

> 
> > 
> > any idea on why could this be happening or what to check?
> 
> I helps to know what kind of maintenance was going on. Sometimes Ceph PG
> mappings are not what you want. There are ways to do maintenance in a more
> controlled fashion.

the maintenance itself wasn't ceph related, it shouldn't cause any PG movements..
one thing to note, I added SSD volume for all OSD DBs to speed up recovery, but
we've hat this problem before that, so I don't think this should be the culprit..

BR

nik

> 
> > 
> > thanks a lot in advance for hints..
> 
> Gr. Stefan
> 
> [1]: https://tracker.ceph.com/issues/39555
> 

-- 
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@xxxxxxxxxxx
-------------------------------------
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx