Hi Simon,
You may want to look into https://github.com/digitalocean/pgremapper to get the situation under control first.
--
Alex Gorbachev
ISS
Alex Gorbachev
ISS
On Fri, Aug 16, 2024 at 5:10 AM Simon Oosthoek <simon.oosthoek@xxxxxxxxx> wrote:
_______________________________________________HiWe had a really weird outage today of ceph and I wonder how it came about.The problem seems to have started around midnight, I still need to look if it was to the extend I found it in this morning or if it grew more gradually, but when I found it several osd servers had most or all osd processes down, to the point where our EC 8+3 buckets didn't work anymore.Restarting the servers or the services turned out to be the way to quickly recover from this.I see some of our OSDs are coming close to (but not quite) 80-85% full, There are many times when I've seen an overfull error lead to cascading and catastrophic failures. I suspect this may have been one of them.Which brings me to another question, why is our balancer doing so badly at balancing the OSDs? It's configured with upmap mode and it should work great with the amount of PGs per OSD we have, but it is letting some OSD's reach 80% full and others not yet 50% full (we're just over 61% full in total).The current health status is:HEALTH_WARN Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull
[WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull
pg 30.3fc is active+remapped+backfill_wait+backfill_toofull, acting [66,105,124,113,89,132,206,242,179]
I've started reweighting again, because the balancer is not doing it's job in our cluster for some reason...Below is our dashboard overview, you can see the start and recovery in the 24h graph...Cheers/Simon
--I'm using my gmail.com address, because the gmail.com dmarc policy is "none", some mail servers will reject this (microsoft?) others will instead allow this when I send mail to a mailling list which has not yet been configured to send mail "on behalf of" the sender, but rather do a kind of "forward". The latter situation causes dkim/dmarc failures and the dmarc policy will be applied. see https://wiki.list.org/DEV/DMARC for more details
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx