Re: weird outage of ceph

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Simon,

You may want to look into https://github.com/digitalocean/pgremapper to get the situation under control first.

--
Alex Gorbachev
ISS



On Fri, Aug 16, 2024 at 5:10 AM Simon Oosthoek <simon.oosthoek@xxxxxxxxx> wrote:
Hi

We had a really weird outage today of ceph and I wonder how it came about.
The problem seems to have started around midnight, I still need to look if it was to the extend I found it in this morning or if it grew more gradually, but when I found it several osd servers had most or all osd processes down, to the point where our EC 8+3 buckets didn't work anymore.

Restarting the servers or the services turned out to be the way to quickly recover from this.

I see some of our OSDs are coming close to (but not quite) 80-85% full, There are many times when I've seen an overfull error lead to cascading and catastrophic failures. I suspect this may have been one of them.

Which brings me to another question, why is our balancer doing so badly at balancing the OSDs? It's configured with upmap mode and it should work great with the amount of PGs per OSD we have, but it is letting some OSD's reach 80% full and others not yet 50% full (we're just over 61% full in total).

The current health status is:
HEALTH_WARN Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull
[WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull
   pg 30.3fc is active+remapped+backfill_wait+backfill_toofull, acting [66,105,124,113,89,132,206,242,179]

I've started reweighting again, because the balancer is not doing it's job in our cluster for some reason...

Below is our dashboard overview, you can see the start and recovery in the 24h graph...

Cheers

/Simon

image.png


--
I'm using my gmail.com address, because the gmail.com dmarc policy is "none", some mail servers will reject this (microsoft?) others will instead allow this when I send mail to a mailling list which has not yet been configured to send mail "on behalf of" the sender, but rather do a kind of "forward". The latter situation causes dkim/dmarc failures and the dmarc policy will be applied. see https://wiki.list.org/DEV/DMARC for more details
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux