Re: weird outage of ceph

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Anthony

Yesterday it happened again and just now I finally had a bit of time to dig some more into this...
Just before the problems occur there are a few things happening:

- there are lots of client requests (~10^7)
- I see heartbeat failures in the logs (a lot!)

2024-08-20T11:52:33.310+0200 7fd65dfca640 -1 osd.140 992615 heartbeat_check: no reply from 131.174.23.137:6840 osd.186 since back 2024-08-20T11:52:07.896404+0200 front 2024-08-20T11:52:07.895428+0200 (oldest deadline│
2024-08-20T11:52:33.310+0200 7fd65dfca640 -1 osd.140 992615 heartbeat_check: no reply from 131.174.23.139:6876 osd.192 since back 2024-08-20T11:52:07.894948+0200 front 2024-08-20T11:52:07.896415+0200 (oldest deadline│
2024-08-20T11:52:33.310+0200 7fd65dfca640 -1 osd.140 992615 heartbeat_check: no reply from 131.174.23.137:6888 osd.194 since back 2024-08-20T11:52:07.895417+0200 front 2024-08-20T11:52:07.894975+0200 (oldest deadline│

We have only one network and most osd nodes are connected to two switches using 25Gbit in a bond.

So could this be a case where lots of RGW operations (eg removing of objects no longer used) flood the OSDs to the point where they run out of time/bandwidth to keep up the heartbeats and then fail entirely?

The reason the osd services stop is not OOM Killer, but

2024-08-20T11:54:34.009+0200 7fb96fb01640  1 osd.28 pg_epoch: 992704 pg[14.76c( v 992594'1421754 (991733'1418187,992594'1421754] local-lis/les=992682/992683 n=36395 ec=931480/614 lis/c=992682/989423 les/c/f=992683/98
2024-08-20T11:54:34.009+0200 7fb96f300640  1 osd.28 pg_epoch: 992704 pg[30.187s0( v 992580'836055 (983035'832531,992580'836055] local-lis/les=992269/992270 n=81123 ec=945565/935970 lis/c=992269/992269 les/c/f=992270/
2024-08-20T11:54:34.009+0200 7fb976b0f640  0 osd.28 992704 _committed_osd_maps shutdown OSD via async signal                                                                                                            │
2024-08-20T11:54:34.009+0200 7fb96eaff640  1 osd.28 pg_epoch: 992704 pg[14.435( v 992606'1366231 (991866'1363168,992606'1366231] local-lis/les=992702/992703 n=36357 ec=926450/614 lis/c=992702/992683 les/c/f=992703/99
2024-08-20T11:54:34.009+0200 7fb987b64640 -1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0                                                         │
2024-08-20T11:54:34.009+0200 7fb96fb01640  1 osd.28 pg_epoch: 992704 pg[30.53cs1( v 992580'845463 (983572'842135,992580'845463] local-lis/les=992270/992271 n=81329 ec=954589/935970 lis/c=992270/992270 les/c/f=992271/
2024-08-20T11:54:34.009+0200 7fb96eaff640  1 osd.28 pg_epoch: 992704 pg[14.435( v 992606'1366231 (991866'1363168,992606'1366231] local-lis/les=992702/992703 n=36357 ec=926450/614 lis/c=992702/992683 les/c/f=992703/99
2024-08-20T11:54:34.009+0200 7fb987b64640 -1 osd.28 992704 *** Got signal Interrupt ***

I'm wondering if there's a ceph configuration thing that could make this less likely or that it's just that we need to look for the source of the flood and prevent this from happening at the rate we are seeing...

Cheers

/Simon

On Mon, 19 Aug 2024 at 10:00, Simon Oosthoek <simon.oosthoek@xxxxxxxxx> wrote:
Hi Anthony,

thanks for the suggested strategy! I'll have a look to see how and what I can do to fix it.

To answer your question about pgs/osd, we have 288 OSDs (in 24 osd servers) with about ten pools (besides the .mgr pool), with various PGs per pool (depending on amount stored in it), we have pg autoscaler on and the main 2 pools are 2048 pgs/pool and there's one with 512 and the rest is 32 pgs per pool (very little data).

We mainly use EC-5+4 and RBD 3copy

Cheers

/Simon

On Sun, 18 Aug 2024 at 16:53, Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:



You may want to look into https://github.com/digitalocean/pgremapper to get the situation under control first.

--
Alex Gorbachev
ISS

Not a bad idea.

We had a really weird outage today of ceph and I wonder how it came about.
The problem seems to have started around midnight, I still need to look if it was to the extend I found it in this morning or if it grew more gradually, but when I found it several osd servers had most or all osd processes down, to the point where our EC 8+3 buckets didn't work anymore.

Look at your metrics and systems.  Were the OSDs OOMkilled?

I see some of our OSDs are coming close to (but not quite) 80-85% full, There are many times when I've seen an overfull error lead to cascading and catastrophic failures. I suspect this may have been one of them.

One can (temporarily) raise the backfillfull / full ratios to help get out of a bad situation, but leaving them raised can lead to an even worse situation later.


Which brings me to another question, why is our balancer doing so badly at balancing the OSDs?

There are certain situations where the bundled balancer is confounded, including CRUSH trees with multiple roots.  Subtly, that may include a cluster where some CRUSH rules specify a deviceclass and some don’t, as with the .mgr pool if deployed by Rook.  That situation confounds the PG autoscaler for sure.  If this is the case, consider modifying CRUSH rules so that all specify a deviceclass, and/or simplifying your CRUSH tree if you have explicit multiple roots.

It's configured with upmap mode and it should work great with the amount of PGs per OSD we have

Which is?

, but it is letting some OSD's reach 80% full and others not yet 50% full (we're just over 61% full in total).

The current health status is:
HEALTH_WARN Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull
[WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this doesn't resolve itself): 1 pg backfill_toofull
   pg 30.3fc is active+remapped+backfill_wait+backfill_toofull, acting [66,105,124,113,89,132,206,242,179]

I've started reweighting again, because the balancer is not doing it's job in our cluster for some reason...

Reweighting … are you doing “ceph osd crush reweight”, or “ceph osd reweight / reweight-by-utilization” ? The latter in conjunction with pg-upmap confuses the balancer.  If that’s the situation you have, I might

* Use pg-remapper or Dan’s https://gitlab.cern.ch/ceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py to freeze the PG mappings temporarily.
* Temp jack up the backfillfull/full ratios for some working room, say to 95 / 98 %
* One at a time, reset the override reweights to 1.0.  No data should move.
* Remove the manual upmaps one at time, in order of PGs on the most-full OSDs  You should see a brief spurt of backfill
* Rinse, lather, repeat.
* This should progressively get you to a state where you no longer have any old-style override reweights, i.e. all OSDs have 1.00000 for that value.
* Proceed removing the manual upmaps one or a few at a time
* The balancer should work now
* Set the ratios back to the default values



Below is our dashboard overview, you can see the start and recovery in the 24h graph...

Cheers

/Simon

image.png


--
I'm using my gmail.com address, because the gmail.com dmarc policy is "none", some mail servers will reject this (microsoft?) others will instead allow this when I send mail to a mailling list which has not yet been configured to send mail "on behalf of" the sender, but rather do a kind of "forward". The latter situation causes dkim/dmarc failures and the dmarc policy will be applied. see https://wiki.list.org/DEV/DMARC for more details
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



--
I'm using my gmail.com address, because the gmail.com dmarc policy is "none", some mail servers will reject this (microsoft?) others will instead allow this when I send mail to a mailling list which has not yet been configured to send mail "on behalf of" the sender, but rather do a kind of "forward". The latter situation causes dkim/dmarc failures and the dmarc policy will be applied. see https://wiki.list.org/DEV/DMARC for more details


--
I'm using my gmail.com address, because the gmail.com dmarc policy is "none", some mail servers will reject this (microsoft?) others will instead allow this when I send mail to a mailling list which has not yet been configured to send mail "on behalf of" the sender, but rather do a kind of "forward". The latter situation causes dkim/dmarc failures and the dmarc policy will be applied. see https://wiki.list.org/DEV/DMARC for more details
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux