Hi, maybe this is related. On a system with many disks I also had aio problems causing OSDs to hang. Here it was the kernel parameter fs.aio-max-nr that was way too low by default. I bumped it to fs.aio-max-nr = 1048576 (sysctl/tuned) and OSDs came up right away. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock@xxxxxx> Sent: Thursday, January 18, 2024 9:46 AM To: ceph-users@xxxxxxx Subject: Re: Adding OSD's results in slow ops, inactive PG's I'm glad to hear (or read) that it worked for you as well. :-) Zitat von Torkil Svensgaard <torkil@xxxxxxxx>: > On 18/01/2024 09:30, Eugen Block wrote: >> Hi, >> >>> [ceph: root@lazy /]# ceph-conf --show-config | egrep >>> osd_max_pg_per_osd_hard_ratio >>> osd_max_pg_per_osd_hard_ratio = 3.000000 >> >> I don't think this is the right tool, it says: >> >>> --show-config-value <key> Print the corresponding ceph.conf value >>> that matches the specified key. >>> Also searches >>> global defaults. >> >> I suggest to query the daemon directly: >> >> storage01:~ # ceph config set osd osd_max_pg_per_osd_hard_ratio 5 >> >> storage01:~ # ceph tell osd.0 config get osd_max_pg_per_osd_hard_ratio >> { >> "osd_max_pg_per_osd_hard_ratio": "5.000000" >> } > > Copy that, verified to be 5 now. > >>> Daemons are running but those last OSDs won't come online. >>> I've tried upping bdev_aio_max_queue_depth but it didn't seem to >>> make a difference. >> >> I don't have any good idea for that right now except what you >> already tried. Which values for bdev_aio_max_queue_depth have you >> tried? > > The previous value was 1024, I bumped it to 4096. > > A couple of the OSDs seemingly stuck on the aio thing has now come > to, so I went ahead and added the rest. Some of them came in right > away, some are stuck on the aio thing. Hopefully they will recover > eventually. > > Thanks you again for the osd_max_pg_per_osd_hard_ratio suggestion, > that seems to have solved the core issue =) > > Mvh. > > Torkil > >> >> Zitat von Torkil Svensgaard <torkil@xxxxxxxx>: >> >>> On 18/01/2024 07:48, Eugen Block wrote: >>>> Hi, >>>> >>>>> -3281> 2024-01-17T14:57:54.611+0000 7f2c6f7ef540 0 osd.431 >>>>> 2154828 load_pgs opened 750 pgs <--- >>>> >>>> I'd say that's close enough to what I suspected. ;-) Not sure why >>>> the "maybe_wait_for_max_pg" message isn't there but I'd give it a >>>> try with a higher osd_max_pg_per_osd_hard_ratio. >>> >>> Might have helped, not quite sure. >>> >>> I've set these since I wasn't sure which one was the right one?: >>> >>> " >>> ceph config dump | grep osd_max_pg_per_osd_hard_ratio >>> global advanced osd_max_pg_per_osd_hard_ratio >>> 5.000000 osd advanced >>> osd_max_pg_per_osd_hard_ratio 5.000000 >>> " >>> >>> Restarted MONs and MGRs. Still getting this with ceph-conf though: >>> >>> " >>> [ceph: root@lazy /]# ceph-conf --show-config | egrep >>> osd_max_pg_per_osd_hard_ratio >>> osd_max_pg_per_osd_hard_ratio = 3.000000 >>> " >>> >>> I re-added a couple small SSD OSDs and they came in just fine. I >>> then added a couple HDD OSDs and they also came in after a bit of >>> aio_submit spam. I added a couple more and have now been looking >>> at this for 40 minutes: >>> >>> >>> " >>> ... >>> >>> 2024-01-18T07:42:01.789+0000 7f734fa04700 -1 bdev(0x56295d586400 >>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 10 >>> 2024-01-18T07:42:01.808+0000 7f734fa04700 -1 bdev(0x56295d586400 >>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 4 >>> 2024-01-18T07:42:01.819+0000 7f735d1b8700 -1 bdev(0x56295d586400 >>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 82 >>> 2024-01-18T07:42:07.499+0000 7f734fa04700 -1 bdev(0x56295d586400 >>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 6 >>> 2024-01-18T07:42:07.542+0000 7f734fa04700 -1 bdev(0x56295d586400 >>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 8 >>> 2024-01-18T07:42:07.554+0000 7f735d1b8700 -1 bdev(0x56295d586400 >>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 108 >>> ... >>> " >>> >>> Daemons are running but those last OSDs won't come online. >>> >>> I've tried upping bdev_aio_max_queue_depth but it didn't seem to >>> make a difference. >>> >>> Mvh. >>> >>> Torkil >>> >>>> >>>> Zitat von Torkil Svensgaard <torkil@xxxxxxxx>: >>>> >>>>> On 17-01-2024 22:20, Eugen Block wrote: >>>>>> Hi, >>>>> >>>>> Hi >>>>> >>>>>> this sounds a bit like a customer issue we had almost two years >>>>>> ago. Basically, it was about mon_max_pg_per_osd (default 250) >>>>>> which was exceeded during the first activating OSD (and the >>>>>> last remaining stopping OSD). You can read all the details in >>>>>> the lengthy thread [1]. But if this is was the actual issue you >>>>>> probably should see something like this in the logs: >>>>>> >>>>>> 2022-04-06 14:24:55.256 7f8bb5a0e700 1 osd.8 43377 >>>>>> maybe_wait_for_max_pg withhold creation of pg 75.56s16: 750 >= >>>>>> 750 >>>>>> >>>>>> In our case we did the opposite and removed an entire host. >>>>>> I'll just quote Josh's explanation from the mentioned thread: >>>>>> >>>>>> 1. All OSDs on the host are purged per above. >>>>>> 2. New OSDs are created. >>>>>> 3. As they come up, one by one, CRUSH starts to assign PGs to them. >>>>>> Importantly, when the first OSD comes up, it gets a large number of >>>>>> OSDs, exceeding mon_max_pg_per_osd. Thus, some of these PGs don't >>>>>> activate. >>>>>> 4. As each of the remaining OSDs come up, CRUSH re-assigns some >>>>>> PGs to them. >>>>>> 5. Finally, all OSDs are up. However, any PGs that were stuck in >>>>>> "activating" from step 3 that were _not_ reassigned to other OSDs are >>>>>> still stuck in "activating", and need a repeer or OSD down/up cycle to >>>>>> restart peering for them. (At least in Pacific, tweaking >>>>>> mon_max_pg_per_osd also allows some of these PGs to make peering >>>>>> progress.) >>>>>> >>>>>> Note that during backfill/recovery the limit is 750 >>>>>> (mon_max_pg_per_osd * osd_max_pg_per_osd_hard_ratio ==> 250 * 3 >>>>>> = 750). As a workaround we increased >>>>>> osd_max_pg_per_osd_hard_ratio to 5 and the issue was never seen >>>>>> again. >>>>>> Can you check the logs for that message? >>>>> >>>>> I can't find that error in the logs but I notice this: >>>>> >>>>> " >>>>> [root@dreamy 8ee2d228-ed21-4580-8bbf-0649f229e21d]# less >>>>> ceph-osd.431.log | grep load_pgs >>>>> 2024-01-17T13:07:41.202+0000 7f74eae5f540 0 osd.431 0 load_pgs >>>>> 2024-01-17T13:07:41.202+0000 7f74eae5f540 0 osd.431 0 load_pgs >>>>> opened 0 pgs >>>>> 2024-01-17T13:48:31.781+0000 7f381e34b540 0 osd.431 2154180 load_pgs >>>>> 2024-01-17T13:48:37.881+0000 7f381e34b540 0 osd.431 2154180 >>>>> load_pgs opened 559 pgs >>>>> 2024-01-17T13:52:54.473+0000 7f5e713a7540 0 osd.431 2154420 load_pgs >>>>> 2024-01-17T13:53:00.156+0000 7f5e713a7540 0 osd.431 2154420 >>>>> load_pgs opened 559 pgs >>>>> 2024-01-17T13:58:42.052+0000 7f23c3482540 0 osd.431 2154580 load_pgs >>>>> 2024-01-17T13:58:47.710+0000 7f23c3482540 0 osd.431 2154580 >>>>> load_pgs opened 559 pgs >>>>> -5202> 2024-01-17T13:58:42.052+0000 7f23c3482540 0 osd.431 >>>>> 2154580 load_pgs >>>>> -5184> 2024-01-17T13:58:47.710+0000 7f23c3482540 0 osd.431 >>>>> 2154580 load_pgs opened 559 pgs >>>>> 2024-01-17T14:57:46.714+0000 7f2c6f7ef540 0 osd.431 2154828 load_pgs >>>>> 2024-01-17T14:57:54.611+0000 7f2c6f7ef540 0 osd.431 2154828 >>>>> load_pgs opened 750 pgs >>>>> -3363> 2024-01-17T14:57:46.714+0000 7f2c6f7ef540 0 osd.431 >>>>> 2154828 load_pgs >>>>> -3281> 2024-01-17T14:57:54.611+0000 7f2c6f7ef540 0 osd.431 >>>>> 2154828 load_pgs opened 750 pgs <--- >>>>> " >>>>> >>>>> We'll try increasing osd_max_pg_per_osd_hard_ratio to 5 tomorrow >>>>> when onsite just to check if it makes a difference. >>>>> >>>>> This is what we have in the log for osd.11, one of the stalled >>>>> osds, when osd.431 was started: >>>>> >>>>> " >>>>> 2024-01-17T14:57:59.064+0000 7f644154e700 0 >>>>> log_channel(cluster) log [INF] : 11.857s0 continuing backfill to >>>>> osd.104(4) from (2151654'3111903,2154845'3112716] MIN to >>>>> 2154845'3112716 >>>>> 2024-01-17T14:57:59.064+0000 7f644154e700 0 >>>>> log_channel(cluster) log [DBG] : 11.857s0 starting backfill to >>>>> osd.128(3) from (2151654'3111903,2154278'3112714] MIN to >>>>> 2154845'3112716 >>>>> 2024-01-17T14:57:59.067+0000 7f643fd4b700 0 >>>>> log_channel(cluster) log [DBG] : 11.5e4s0 starting backfill to >>>>> osd.431(5) from (2151000'3067292,2154631'3068041] MIN to >>>>> 2154848'3068052 >>>>> 2024-01-17T14:57:59.067+0000 7f644054c700 0 >>>>> log_channel(cluster) log [INF] : 11.cc8s0 continuing backfill to >>>>> osd.184(5) from (2151675'2997168,2154838'2997942] MIN to >>>>> 2154838'2997942 >>>>> 2024-01-17T14:57:59.067+0000 7f644054c700 0 >>>>> log_channel(cluster) log [DBG] : 11.cc8s0 starting backfill to >>>>> osd.431(1) from (2151675'2997168,2154820'2997923] MIN to >>>>> 2154838'2997942 >>>>> 2024-01-17T14:57:59.071+0000 7f643f54a700 0 >>>>> log_channel(cluster) log [DBG] : 11.883s0 starting backfill to >>>>> osd.431(4) from (2151058'2837151,2154693'2837901] MIN to >>>>> 2154845'2837919 >>>>> 2024-01-17T14:57:59.079+0000 7f6440d4d700 0 >>>>> log_channel(cluster) log [DBG] : 11.637s0 starting backfill to >>>>> osd.431(5) from (2151654'3204863,2154801'3205673] MIN to >>>>> 2154845'3205687 >>>>> 2024-01-17T14:57:59.106+0000 7f643fd4b700 0 >>>>> log_channel(cluster) log [INF] : 11.4f9s0 continuing backfill to >>>>> osd.149(3) from (2151654'3169077,2154845'3169891] MIN to >>>>> 2154845'3169891 >>>>> 2024-01-17T14:57:59.106+0000 7f643fd4b700 0 >>>>> log_channel(cluster) log [DBG] : 11.4f9s0 starting backfill to >>>>> osd.431(4) from (2151654'3169077,2154183'3169887] MIN to >>>>> 2154845'3169891 >>>>> 2024-01-17T14:57:59.122+0000 7f644054c700 0 >>>>> log_channel(cluster) log [DBG] : 11.c73s0 starting backfill to >>>>> osd.431(3) from (2152025'3130484,2154822'3131249] MIN to >>>>> 2154845'3131256 >>>>> 2024-01-17T14:57:59.135+0000 7f644154e700 0 >>>>> log_channel(cluster) log [INF] : 11.857s0 continuing backfill to >>>>> osd.272(0) from (2151654'3111903,2154845'3112716] MIN to >>>>> 2154845'3112716 >>>>> 2024-01-17T14:57:59.136+0000 7f644154e700 0 >>>>> log_channel(cluster) log [INF] : 11.857s0 continuing backfill to >>>>> osd.333(2) from (2151654'3111903,2154845'3112716] MIN to >>>>> 2154845'3112716 >>>>> 2024-01-17T14:57:59.136+0000 7f644154e700 0 >>>>> log_channel(cluster) log [INF] : 11.857s0 continuing backfill to >>>>> osd.423(5) from (2151654'3111903,2154845'3112716] MIN to >>>>> 2154845'3112716 >>>>> 2024-01-17T14:57:59.136+0000 7f644154e700 0 >>>>> log_channel(cluster) log [DBG] : 11.857s0 starting backfill to >>>>> osd.431(1) from (2151654'3111903,2154278'3112714] MIN to >>>>> 2154845'3112716 >>>>> 2024-01-17T14:57:59.161+0000 7f644154e700 0 >>>>> log_channel(cluster) log [INF] : 11.f8cs0 continuing backfill to >>>>> osd.105(4) from (2152025'3065404,2154848'3066167] MIN to >>>>> 2154848'3066167 >>>>> 2024-01-17T14:57:59.162+0000 7f644154e700 0 >>>>> log_channel(cluster) log [INF] : 11.f8cs0 continuing backfill to >>>>> osd.421(2) from (2152025'3065404,2154848'3066167] MIN to >>>>> 2154848'3066167 >>>>> 2024-01-17T14:57:59.162+0000 7f644154e700 0 >>>>> log_channel(cluster) log [DBG] : 11.f8cs0 starting backfill to >>>>> osd.431(5) from (0'0,0'0] MAX to 2154848'3066167 >>>>> 2024-01-17T14:57:59.167+0000 7f643f54a700 0 >>>>> log_channel(cluster) log [DBG] : 37.6ees0 starting backfill to >>>>> osd.431(8) from (0'0,0'0] MAX to 2154128'24444 >>>>> 2024-01-17T14:58:33.532+0000 7f6455d77700 -1 osd.11 2154852 >>>>> get_health_metrics reporting 1 slow ops, oldest is >>>>> osd_op(client.1823267746.0:788334371 11.f8cs0 11.2dd6f8c >>>>> (undecoded) ondisk+ >>>>> write+known_if_redirected e2154852) >>>>> ... >>>>> " >>>>> >>>>> Not sure if that provides any clues. >>>>> >>>>> Thanks! >>>>> >>>>> Mvh. >>>>> >>>>> Torkil >>>>> >>>>>> Regards, >>>>>> Eugen >>>>>> >>>>>> [1] https://www.spinics.net/lists/ceph-users/msg71933.html >>>>>> >>>>>> Zitat von Ruben Vestergaard <rubenv@xxxxxxxx>: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> We have a cluster with which currently looks like so: >>>>>>> >>>>>>> services: >>>>>>> mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 13d) >>>>>>> mgr: jolly.tpgixt(active, since 25h), standbys: >>>>>>> dopey.lxajvk, lazy.xuhetq >>>>>>> mds: 1/1 daemons up, 2 standby >>>>>>> osd: 449 osds: 425 up (since 15m), 425 in (since 5m); >>>>>>> 5104 remapped pgs >>>>>>> data: >>>>>>> volumes: 1/1 healthy >>>>>>> pools: 13 pools, 11153 pgs >>>>>>> objects: 304.11M objects, 988 TiB >>>>>>> usage: 1.6 PiB used, 1.4 PiB / 2.9 PiB avail >>>>>>> pgs: 6/1617270006 objects degraded (0.000%) >>>>>>> 366696947/1617270006 objects misplaced (22.674%) >>>>>>> 6043 active+clean >>>>>>> 5041 active+remapped+backfill_wait >>>>>>> 66 active+remapped+backfilling >>>>>>> 2 active+recovery_wait+degraded+remapped >>>>>>> 1 active+recovering+degraded >>>>>>> >>>>>>> >>>>>>> It's currently rebalancing after adding a node, but this >>>>>>> rebalance has been rather slow -- right now it's running 66 >>>>>>> backfills, but it seems to stabilize around 8 backfills >>>>>>> eventually. We figured that perhaps adding another node might >>>>>>> speed things up. >>>>>>> >>>>>>> Immediately upon adding the node, we get slow ops and inactive >>>>>>> PG's. Removing the new node gets us back in working order. >>>>>>> >>>>>>> It turns out that even adding 1 OSD breaks the cluster, and >>>>>>> immediately sends it here: >>>>>>> >>>>>>> [WRN] PG_DEGRADED: Degraded data redundancy: 6/1617265712 >>>>>>> objects degraded (0.000%), 3 pgs degraded >>>>>>> pg 37.c8 is active+recovery_wait+degraded+remapped, >>>>>>> acting [410,163,236,209,7,283,155,143,78] >>>>>>> pg 37.1a1 is active+recovering+degraded, acting >>>>>>> [234,424,163,74,22,128,177,153,181] >>>>>>> pg 37.1da is active+recovery_wait+degraded+remapped, >>>>>>> acting [163,408,230,190,93,284,50,78,44] >>>>>>> [WRN] SLOW_OPS: 22 slow ops, oldest one blocked for 54 >>>>>>> sec, daemons >>>>>>> [osd.11,osd.110,osd.112,osd.117,osd.120,osd.123,osd.13,osd. >>>>>>> 136,osd.144,osd.157]... have slow ops. >>>>>>> >>>>>>> The OSD added had number 431, so it does not appear to be the >>>>>>> immediate cause of the slow ops, however, removing 431 >>>>>>> immediately clears the problem. >>>>>>> >>>>>>> We thought we might be experiencing 'Crush giving up too soon' >>>>>>> symptoms [1], as we have seen similar behaviour on another >>>>>>> pool, but it does not appear to be the case here. We went >>>>>>> through the motions described on the page and everything >>>>>>> looked OK. >>>>>>> >>>>>>> At least one pool which stops working is a 4+2 EC pool, placed >>>>>>> on spinning rust, some 200-ish disks distributed across 13 >>>>>>> nodes. I'm not sure if other pools break, but that particular >>>>>>> 4+2 EC pool is rather important so I'm a little wary of >>>>>>> experimenting blindly. >>>>>>> >>>>>>> Any thoughts on where to look next? >>>>>>> >>>>>>> Thanks, >>>>>>> Ruben Vestergaard >>>>>>> >>>>>>> [1] https://docs.ceph.com/en/reef/rados/troubleshooting/ >>>>>>> troubleshooting-pg/#crush-gives-up-too-soon >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>>> >>>>> -- >>>>> Torkil Svensgaard >>>>> Systems Administrator >>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714 >>>>> Copenhagen University Hospital Amager and Hvidovre >>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@xxxxxxx >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> >>> -- >>> Torkil Svensgaard >>> Sysadmin >>> MR-Forskningssektionen, afs. 714 >>> DRCMR, Danish Research Centre for Magnetic Resonance >>> Hvidovre Hospital >>> Kettegård Allé 30 >>> DK-2650 Hvidovre >>> Denmark >>> Tel: +45 386 22828 >>> E-mail: torkil@xxxxxxxx >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > -- > Torkil Svensgaard > Sysadmin > MR-Forskningssektionen, afs. 714 > DRCMR, Danish Research Centre for Magnetic Resonance > Hvidovre Hospital > Kettegård Allé 30 > DK-2650 Hvidovre > Denmark > Tel: +45 386 22828 > E-mail: torkil@xxxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx