Re: Adding OSD's results in slow ops, inactive PG's

Torkil Svensgaard <torkil@xxxxxxxx> · Thu, 18 Jan 2024 11:18:18 +0100

On 18/01/2024 10:46, Frank Schilder wrote:
Hi, maybe this is related. On a system with many disks I also had aio problems causing OSDs to hang. Here it was the kernel parameter fs.aio-max-nr that was way too low by default. I bumped it to fs.aio-max-nr = 1048576 (sysctl/tuned) and OSDs came up right away.

Hi

I was actually just digging into this. It seems we already have 
fs.aio-max-nr at max, perhaps a RHEL (9) default?

[root@dreamy ~]# cat /proc/sys/fs/aio-max-nr
1048576

Good thing that it is:

[root@dreamy ~]# cat /proc/sys/fs/aio-nr
573440

Bumping bdev_aio_max_queue_depth to 8192 seems to have made the 
aio_submit retries go away, but the OSDs still take quite a while 
getting in.

Mvh.

Torkil

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Thursday, January 18, 2024 9:46 AM
To: ceph-users@xxxxxxx
Subject:  Re: Adding OSD's results in slow ops, inactive PG's

I'm glad to hear (or read) that it worked for you as well. :-)

Zitat von Torkil Svensgaard <torkil@xxxxxxxx>:

On 18/01/2024 09:30, Eugen Block wrote:
Hi,

[ceph: root@lazy /]# ceph-conf --show-config | egrep
osd_max_pg_per_osd_hard_ratio
osd_max_pg_per_osd_hard_ratio = 3.000000

I don't think this is the right tool, it says:

--show-config-value <key>       Print the corresponding ceph.conf value
                                 that matches the specified key.
Also searches
                                 global defaults.

I suggest to query the daemon directly:

storage01:~ # ceph config set osd osd_max_pg_per_osd_hard_ratio 5

storage01:~ # ceph tell osd.0 config get osd_max_pg_per_osd_hard_ratio
{
     "osd_max_pg_per_osd_hard_ratio": "5.000000"
}

Copy that, verified to be 5 now.

Daemons are running but those last OSDs won't come online.
I've tried upping bdev_aio_max_queue_depth but it didn't seem to
make a difference.

I don't have any good idea for that right now except what you
already tried. Which values for bdev_aio_max_queue_depth have you
tried?

The previous value was 1024, I bumped it to 4096.

A couple of the OSDs seemingly stuck on the aio thing has now come
to, so I went ahead and added the rest. Some of them came in right
away, some are stuck on the aio thing. Hopefully they will recover
eventually.

Thanks you again for the osd_max_pg_per_osd_hard_ratio suggestion,
that seems to have solved the core issue =)

Mvh.

Torkil

Zitat von Torkil Svensgaard <torkil@xxxxxxxx>:

On 18/01/2024 07:48, Eugen Block wrote:
Hi,

  -3281> 2024-01-17T14:57:54.611+0000 7f2c6f7ef540  0 osd.431
2154828 load_pgs opened 750 pgs <---

I'd say that's close enough to what I suspected. ;-) Not sure why
the "maybe_wait_for_max_pg" message isn't there but I'd give it a
try with a higher osd_max_pg_per_osd_hard_ratio.

Might have helped, not quite sure.

I've set these since I wasn't sure which one was the right one?:

"
ceph config dump | grep osd_max_pg_per_osd_hard_ratio
global            advanced  osd_max_pg_per_osd_hard_ratio
5.000000 osd               advanced
osd_max_pg_per_osd_hard_ratio    5.000000
"

Restarted MONs and MGRs. Still getting this with ceph-conf though:

"
[ceph: root@lazy /]# ceph-conf --show-config | egrep
osd_max_pg_per_osd_hard_ratio
osd_max_pg_per_osd_hard_ratio = 3.000000
"

I re-added a couple small SSD OSDs and they came in just fine. I
then added a couple HDD OSDs and they also came in after a bit of
aio_submit spam. I added a couple more and have now been looking
at this for 40 minutes:

"
...

2024-01-18T07:42:01.789+0000 7f734fa04700 -1 bdev(0x56295d586400
/var/lib/ceph/osd/ceph-436/block) aio_submit retries 10
2024-01-18T07:42:01.808+0000 7f734fa04700 -1 bdev(0x56295d586400
/var/lib/ceph/osd/ceph-436/block) aio_submit retries 4
2024-01-18T07:42:01.819+0000 7f735d1b8700 -1 bdev(0x56295d586400
/var/lib/ceph/osd/ceph-436/block) aio_submit retries 82
2024-01-18T07:42:07.499+0000 7f734fa04700 -1 bdev(0x56295d586400
/var/lib/ceph/osd/ceph-436/block) aio_submit retries 6
2024-01-18T07:42:07.542+0000 7f734fa04700 -1 bdev(0x56295d586400
/var/lib/ceph/osd/ceph-436/block) aio_submit retries 8
2024-01-18T07:42:07.554+0000 7f735d1b8700 -1 bdev(0x56295d586400
/var/lib/ceph/osd/ceph-436/block) aio_submit retries 108
...
"

Daemons are running but those last OSDs won't come online.

I've tried upping bdev_aio_max_queue_depth but it didn't seem to
make a difference.

Mvh.

Torkil

Zitat von Torkil Svensgaard <torkil@xxxxxxxx>:

On 17-01-2024 22:20, Eugen Block wrote:
Hi,

Hi

this sounds a bit like a customer issue we had almost two years
ago. Basically, it was about mon_max_pg_per_osd (default 250)
which was exceeded during the first activating OSD (and the
last remaining stopping OSD). You can read all the details in
the lengthy thread [1]. But if this is was the actual issue you
probably should see something like this in the logs:

2022-04-06 14:24:55.256 7f8bb5a0e700 1 osd.8 43377
maybe_wait_for_max_pg withhold creation of pg 75.56s16: 750 >=
750

In our case we did the opposite and removed an entire host.
I'll just quote Josh's explanation from the mentioned thread:

1. All OSDs on the host are purged per above.
2. New OSDs are created.
3. As they come up, one by one, CRUSH starts to assign PGs to them.
Importantly, when the first OSD comes up, it gets a large number of
OSDs, exceeding mon_max_pg_per_osd. Thus, some of these PGs don't
activate.
4. As each of the remaining OSDs come up, CRUSH re-assigns some
PGs to them.
5. Finally, all OSDs are up. However, any PGs that were stuck in
"activating" from step 3 that were _not_ reassigned to other OSDs are
still stuck in "activating", and need a repeer or OSD down/up cycle to
restart peering for them. (At least in Pacific, tweaking
mon_max_pg_per_osd also allows some of these PGs to make peering
progress.)

Note that during backfill/recovery the limit is 750
(mon_max_pg_per_osd * osd_max_pg_per_osd_hard_ratio ==> 250 * 3
= 750). As a workaround we increased
osd_max_pg_per_osd_hard_ratio to 5 and the issue was never seen
again.
Can you check the logs for that message?

I can't find that error in the logs but I notice this:

"
[root@dreamy 8ee2d228-ed21-4580-8bbf-0649f229e21d]# less
ceph-osd.431.log | grep load_pgs
2024-01-17T13:07:41.202+0000 7f74eae5f540  0 osd.431 0 load_pgs
2024-01-17T13:07:41.202+0000 7f74eae5f540  0 osd.431 0 load_pgs
opened 0 pgs
2024-01-17T13:48:31.781+0000 7f381e34b540  0 osd.431 2154180 load_pgs
2024-01-17T13:48:37.881+0000 7f381e34b540  0 osd.431 2154180
load_pgs opened 559 pgs
2024-01-17T13:52:54.473+0000 7f5e713a7540  0 osd.431 2154420 load_pgs
2024-01-17T13:53:00.156+0000 7f5e713a7540  0 osd.431 2154420
load_pgs opened 559 pgs
2024-01-17T13:58:42.052+0000 7f23c3482540  0 osd.431 2154580 load_pgs
2024-01-17T13:58:47.710+0000 7f23c3482540  0 osd.431 2154580
load_pgs opened 559 pgs
  -5202> 2024-01-17T13:58:42.052+0000 7f23c3482540  0 osd.431
2154580 load_pgs
  -5184> 2024-01-17T13:58:47.710+0000 7f23c3482540  0 osd.431
2154580 load_pgs opened 559 pgs
2024-01-17T14:57:46.714+0000 7f2c6f7ef540  0 osd.431 2154828 load_pgs
2024-01-17T14:57:54.611+0000 7f2c6f7ef540  0 osd.431 2154828
load_pgs opened 750 pgs
  -3363> 2024-01-17T14:57:46.714+0000 7f2c6f7ef540  0 osd.431
2154828 load_pgs
  -3281> 2024-01-17T14:57:54.611+0000 7f2c6f7ef540  0 osd.431
2154828 load_pgs opened 750 pgs <---
"

We'll try increasing osd_max_pg_per_osd_hard_ratio to 5 tomorrow
when onsite just to check if it makes a difference.

This is what we have in the log for osd.11, one of the stalled
osds, when osd.431 was started:

"
2024-01-17T14:57:59.064+0000 7f644154e700  0
log_channel(cluster) log [INF] : 11.857s0 continuing backfill to
osd.104(4) from (2151654'3111903,2154845'3112716] MIN to
2154845'3112716
2024-01-17T14:57:59.064+0000 7f644154e700  0
log_channel(cluster) log [DBG] : 11.857s0 starting backfill to
osd.128(3) from (2151654'3111903,2154278'3112714] MIN to
2154845'3112716
2024-01-17T14:57:59.067+0000 7f643fd4b700  0
log_channel(cluster) log [DBG] : 11.5e4s0 starting backfill to
osd.431(5) from (2151000'3067292,2154631'3068041] MIN to
2154848'3068052
2024-01-17T14:57:59.067+0000 7f644054c700  0
log_channel(cluster) log [INF] : 11.cc8s0 continuing backfill to
osd.184(5) from (2151675'2997168,2154838'2997942] MIN to
2154838'2997942
2024-01-17T14:57:59.067+0000 7f644054c700  0
log_channel(cluster) log [DBG] : 11.cc8s0 starting backfill to
osd.431(1) from (2151675'2997168,2154820'2997923] MIN to
2154838'2997942
2024-01-17T14:57:59.071+0000 7f643f54a700  0
log_channel(cluster) log [DBG] : 11.883s0 starting backfill to
osd.431(4) from (2151058'2837151,2154693'2837901] MIN to
2154845'2837919
2024-01-17T14:57:59.079+0000 7f6440d4d700  0
log_channel(cluster) log [DBG] : 11.637s0 starting backfill to
osd.431(5) from (2151654'3204863,2154801'3205673] MIN to
2154845'3205687
2024-01-17T14:57:59.106+0000 7f643fd4b700  0
log_channel(cluster) log [INF] : 11.4f9s0 continuing backfill to
osd.149(3) from (2151654'3169077,2154845'3169891] MIN to
2154845'3169891
2024-01-17T14:57:59.106+0000 7f643fd4b700  0
log_channel(cluster) log [DBG] : 11.4f9s0 starting backfill to
osd.431(4) from (2151654'3169077,2154183'3169887] MIN to
2154845'3169891
2024-01-17T14:57:59.122+0000 7f644054c700  0
log_channel(cluster) log [DBG] : 11.c73s0 starting backfill to
osd.431(3) from (2152025'3130484,2154822'3131249] MIN to
2154845'3131256
2024-01-17T14:57:59.135+0000 7f644154e700  0
log_channel(cluster) log [INF] : 11.857s0 continuing backfill to
osd.272(0) from (2151654'3111903,2154845'3112716] MIN to
2154845'3112716
2024-01-17T14:57:59.136+0000 7f644154e700  0
log_channel(cluster) log [INF] : 11.857s0 continuing backfill to
osd.333(2) from (2151654'3111903,2154845'3112716] MIN to
2154845'3112716
2024-01-17T14:57:59.136+0000 7f644154e700  0
log_channel(cluster) log [INF] : 11.857s0 continuing backfill to
osd.423(5) from (2151654'3111903,2154845'3112716] MIN to
2154845'3112716
2024-01-17T14:57:59.136+0000 7f644154e700  0
log_channel(cluster) log [DBG] : 11.857s0 starting backfill to
osd.431(1) from (2151654'3111903,2154278'3112714] MIN to
2154845'3112716
2024-01-17T14:57:59.161+0000 7f644154e700  0
log_channel(cluster) log [INF] : 11.f8cs0 continuing backfill to
osd.105(4) from (2152025'3065404,2154848'3066167] MIN to
2154848'3066167
2024-01-17T14:57:59.162+0000 7f644154e700  0
log_channel(cluster) log [INF] : 11.f8cs0 continuing backfill to
osd.421(2) from (2152025'3065404,2154848'3066167] MIN to
2154848'3066167
2024-01-17T14:57:59.162+0000 7f644154e700  0
log_channel(cluster) log [DBG] : 11.f8cs0 starting backfill to
osd.431(5) from (0'0,0'0] MAX to 2154848'3066167
2024-01-17T14:57:59.167+0000 7f643f54a700  0
log_channel(cluster) log [DBG] : 37.6ees0 starting backfill to
osd.431(8) from (0'0,0'0] MAX to 2154128'24444
2024-01-17T14:58:33.532+0000 7f6455d77700 -1 osd.11 2154852
get_health_metrics reporting 1 slow ops, oldest is
osd_op(client.1823267746.0:788334371 11.f8cs0 11.2dd6f8c
(undecoded) ondisk+
write+known_if_redirected e2154852)
...
"

Not sure if that provides any clues.

Thanks!

Mvh.

Torkil

Regards,
Eugen

[1] https://www.spinics.net/lists/ceph-users/msg71933.html

Zitat von Ruben Vestergaard <rubenv@xxxxxxxx>:

Hi

We have a cluster with which currently looks like so:

     services:
       mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 13d)
       mgr: jolly.tpgixt(active, since 25h), standbys:
dopey.lxajvk, lazy.xuhetq
       mds: 1/1 daemons up, 2 standby
       osd: 449 osds: 425 up (since 15m), 425 in (since 5m);
5104 remapped pgs
         data:
       volumes: 1/1 healthy
       pools:   13 pools, 11153 pgs
       objects: 304.11M objects, 988 TiB
       usage:   1.6 PiB used, 1.4 PiB / 2.9 PiB avail
       pgs:     6/1617270006 objects degraded (0.000%)
                366696947/1617270006 objects misplaced (22.674%)
                6043 active+clean
                5041 active+remapped+backfill_wait
                66   active+remapped+backfilling
                2    active+recovery_wait+degraded+remapped
                1    active+recovering+degraded

It's currently rebalancing after adding a node, but this
rebalance has been rather slow -- right now it's running 66
backfills, but it seems to stabilize around 8 backfills
eventually. We figured that perhaps adding another node might
speed things up.

Immediately upon adding the node, we get slow ops and inactive
PG's. Removing the new node gets us back in working order.

It turns out that even adding 1 OSD breaks the cluster, and
immediately sends it here:

     [WRN] PG_DEGRADED: Degraded data redundancy: 6/1617265712
objects degraded (0.000%), 3 pgs degraded
         pg 37.c8 is active+recovery_wait+degraded+remapped,
acting [410,163,236,209,7,283,155,143,78]
         pg 37.1a1 is active+recovering+degraded, acting
[234,424,163,74,22,128,177,153,181]
         pg 37.1da is active+recovery_wait+degraded+remapped,
acting [163,408,230,190,93,284,50,78,44]
     [WRN] SLOW_OPS: 22 slow ops, oldest one blocked for 54
sec, daemons
[osd.11,osd.110,osd.112,osd.117,osd.120,osd.123,osd.13,osd.
136,osd.144,osd.157]... have slow ops.

The OSD added had number 431, so it does not appear to be the
immediate cause of the slow ops, however, removing 431
immediately clears the problem.

We thought we might be experiencing 'Crush giving up too soon'
symptoms [1], as we have seen similar behaviour on another
pool, but it does not appear to be the case here. We went
through the motions described on the page and everything
looked OK.

At least one pool which stops working is a 4+2 EC pool, placed
on spinning rust, some 200-ish disks distributed across 13
nodes. I'm not sure if other pools break, but that particular
4+2 EC pool is rather important so I'm a little wary of
experimenting blindly.

Any thoughts on where to look next?

Thanks,
Ruben Vestergaard

[1] https://docs.ceph.com/en/reef/rados/troubleshooting/
troubleshooting-pg/#crush-gives-up-too-soon
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil@xxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil@xxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil@xxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx