Re: Adding OSD's results in slow ops, inactive PG's

Frank Schilder <frans@xxxxxx> · Thu, 18 Jan 2024 09:46:43 +0000

Hi, maybe this is related. On a system with many disks I also had aio problems causing OSDs to hang. Here it was the kernel parameter fs.aio-max-nr that was way too low by default. I bumped it to fs.aio-max-nr = 1048576 (sysctl/tuned) and OSDs came up right away.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eugen Block <eblock@xxxxxx>
Sent: Thursday, January 18, 2024 9:46 AM
To: ceph-users@xxxxxxx
Subject:  Re: Adding OSD's results in slow ops, inactive PG's

I'm glad to hear (or read) that it worked for you as well. :-)

Zitat von Torkil Svensgaard <torkil@xxxxxxxx>:

> On 18/01/2024 09:30, Eugen Block wrote:
>> Hi,
>>
>>> [ceph: root@lazy /]# ceph-conf --show-config | egrep
>>> osd_max_pg_per_osd_hard_ratio
>>> osd_max_pg_per_osd_hard_ratio = 3.000000
>>
>> I don't think this is the right tool, it says:
>>
>>> --show-config-value <key>       Print the corresponding ceph.conf value
>>>                                 that matches the specified key.
>>> Also searches
>>>                                 global defaults.
>>
>> I suggest to query the daemon directly:
>>
>> storage01:~ # ceph config set osd osd_max_pg_per_osd_hard_ratio 5
>>
>> storage01:~ # ceph tell osd.0 config get osd_max_pg_per_osd_hard_ratio
>> {
>>     "osd_max_pg_per_osd_hard_ratio": "5.000000"
>> }
>
> Copy that, verified to be 5 now.
>
>>> Daemons are running but those last OSDs won't come online.
>>> I've tried upping bdev_aio_max_queue_depth but it didn't seem to
>>> make a difference.
>>
>> I don't have any good idea for that right now except what you
>> already tried. Which values for bdev_aio_max_queue_depth have you
>> tried?
>
> The previous value was 1024, I bumped it to 4096.
>
> A couple of the OSDs seemingly stuck on the aio thing has now come
> to, so I went ahead and added the rest. Some of them came in right
> away, some are stuck on the aio thing. Hopefully they will recover
> eventually.
>
> Thanks you again for the osd_max_pg_per_osd_hard_ratio suggestion,
> that seems to have solved the core issue =)
>
> Mvh.
>
> Torkil
>
>>
>> Zitat von Torkil Svensgaard <torkil@xxxxxxxx>:
>>
>>> On 18/01/2024 07:48, Eugen Block wrote:
>>>> Hi,
>>>>
>>>>>  -3281> 2024-01-17T14:57:54.611+0000 7f2c6f7ef540  0 osd.431
>>>>> 2154828 load_pgs opened 750 pgs <---
>>>>
>>>> I'd say that's close enough to what I suspected. ;-) Not sure why
>>>> the "maybe_wait_for_max_pg" message isn't there but I'd give it a
>>>> try with a higher osd_max_pg_per_osd_hard_ratio.
>>>
>>> Might have helped, not quite sure.
>>>
>>> I've set these since I wasn't sure which one was the right one?:
>>>
>>> "
>>> ceph config dump | grep osd_max_pg_per_osd_hard_ratio
>>> global            advanced  osd_max_pg_per_osd_hard_ratio
>>> 5.000000 osd               advanced
>>> osd_max_pg_per_osd_hard_ratio    5.000000
>>> "
>>>
>>> Restarted MONs and MGRs. Still getting this with ceph-conf though:
>>>
>>> "
>>> [ceph: root@lazy /]# ceph-conf --show-config | egrep
>>> osd_max_pg_per_osd_hard_ratio
>>> osd_max_pg_per_osd_hard_ratio = 3.000000
>>> "
>>>
>>> I re-added a couple small SSD OSDs and they came in just fine. I
>>> then added a couple HDD OSDs and they also came in after a bit of
>>> aio_submit spam. I added a couple more and have now been looking
>>> at this for 40 minutes:
>>>
>>>
>>> "
>>> ...
>>>
>>> 2024-01-18T07:42:01.789+0000 7f734fa04700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 10
>>> 2024-01-18T07:42:01.808+0000 7f734fa04700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 4
>>> 2024-01-18T07:42:01.819+0000 7f735d1b8700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 82
>>> 2024-01-18T07:42:07.499+0000 7f734fa04700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 6
>>> 2024-01-18T07:42:07.542+0000 7f734fa04700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 8
>>> 2024-01-18T07:42:07.554+0000 7f735d1b8700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 108
>>> ...
>>> "
>>>
>>> Daemons are running but those last OSDs won't come online.
>>>
>>> I've tried upping bdev_aio_max_queue_depth but it didn't seem to
>>> make a difference.
>>>
>>> Mvh.
>>>
>>> Torkil
>>>
>>>>
>>>> Zitat von Torkil Svensgaard <torkil@xxxxxxxx>:
>>>>
>>>>> On 17-01-2024 22:20, Eugen Block wrote:
>>>>>> Hi,
>>>>>
>>>>> Hi
>>>>>
>>>>>> this sounds a bit like a customer issue we had almost two years
>>>>>> ago. Basically, it was about mon_max_pg_per_osd (default 250)
>>>>>> which was exceeded during the first activating OSD (and the
>>>>>> last remaining stopping OSD). You can read all the details in
>>>>>> the lengthy thread [1]. But if this is was the actual issue you
>>>>>> probably should see something like this in the logs:
>>>>>>
>>>>>> 2022-04-06 14:24:55.256 7f8bb5a0e700 1 osd.8 43377
>>>>>> maybe_wait_for_max_pg withhold creation of pg 75.56s16: 750 >=
>>>>>> 750
>>>>>>
>>>>>> In our case we did the opposite and removed an entire host.
>>>>>> I'll just quote Josh's explanation from the mentioned thread:
>>>>>>
>>>>>> 1. All OSDs on the host are purged per above.
>>>>>> 2. New OSDs are created.
>>>>>> 3. As they come up, one by one, CRUSH starts to assign PGs to them.
>>>>>> Importantly, when the first OSD comes up, it gets a large number of
>>>>>> OSDs, exceeding mon_max_pg_per_osd. Thus, some of these PGs don't
>>>>>> activate.
>>>>>> 4. As each of the remaining OSDs come up, CRUSH re-assigns some
>>>>>> PGs to them.
>>>>>> 5. Finally, all OSDs are up. However, any PGs that were stuck in
>>>>>> "activating" from step 3 that were _not_ reassigned to other OSDs are
>>>>>> still stuck in "activating", and need a repeer or OSD down/up cycle to
>>>>>> restart peering for them. (At least in Pacific, tweaking
>>>>>> mon_max_pg_per_osd also allows some of these PGs to make peering
>>>>>> progress.)
>>>>>>
>>>>>> Note that during backfill/recovery the limit is 750
>>>>>> (mon_max_pg_per_osd * osd_max_pg_per_osd_hard_ratio ==> 250 * 3
>>>>>> = 750). As a workaround we increased
>>>>>> osd_max_pg_per_osd_hard_ratio to 5 and the issue was never seen
>>>>>> again.
>>>>>> Can you check the logs for that message?
>>>>>
>>>>> I can't find that error in the logs but I notice this:
>>>>>
>>>>> "
>>>>> [root@dreamy 8ee2d228-ed21-4580-8bbf-0649f229e21d]# less
>>>>> ceph-osd.431.log | grep load_pgs
>>>>> 2024-01-17T13:07:41.202+0000 7f74eae5f540  0 osd.431 0 load_pgs
>>>>> 2024-01-17T13:07:41.202+0000 7f74eae5f540  0 osd.431 0 load_pgs
>>>>> opened 0 pgs
>>>>> 2024-01-17T13:48:31.781+0000 7f381e34b540  0 osd.431 2154180 load_pgs
>>>>> 2024-01-17T13:48:37.881+0000 7f381e34b540  0 osd.431 2154180
>>>>> load_pgs opened 559 pgs
>>>>> 2024-01-17T13:52:54.473+0000 7f5e713a7540  0 osd.431 2154420 load_pgs
>>>>> 2024-01-17T13:53:00.156+0000 7f5e713a7540  0 osd.431 2154420
>>>>> load_pgs opened 559 pgs
>>>>> 2024-01-17T13:58:42.052+0000 7f23c3482540  0 osd.431 2154580 load_pgs
>>>>> 2024-01-17T13:58:47.710+0000 7f23c3482540  0 osd.431 2154580
>>>>> load_pgs opened 559 pgs
>>>>>  -5202> 2024-01-17T13:58:42.052+0000 7f23c3482540  0 osd.431
>>>>> 2154580 load_pgs
>>>>>  -5184> 2024-01-17T13:58:47.710+0000 7f23c3482540  0 osd.431
>>>>> 2154580 load_pgs opened 559 pgs
>>>>> 2024-01-17T14:57:46.714+0000 7f2c6f7ef540  0 osd.431 2154828 load_pgs
>>>>> 2024-01-17T14:57:54.611+0000 7f2c6f7ef540  0 osd.431 2154828
>>>>> load_pgs opened 750 pgs
>>>>>  -3363> 2024-01-17T14:57:46.714+0000 7f2c6f7ef540  0 osd.431
>>>>> 2154828 load_pgs
>>>>>  -3281> 2024-01-17T14:57:54.611+0000 7f2c6f7ef540  0 osd.431
>>>>> 2154828 load_pgs opened 750 pgs <---
>>>>> "
>>>>>
>>>>> We'll try increasing osd_max_pg_per_osd_hard_ratio to 5 tomorrow
>>>>> when onsite just to check if it makes a difference.
>>>>>
>>>>> This is what we have in the log for osd.11, one of the stalled
>>>>> osds, when osd.431 was started:
>>>>>
>>>>> "
>>>>> 2024-01-17T14:57:59.064+0000 7f644154e700  0
>>>>> log_channel(cluster) log [INF] : 11.857s0 continuing backfill to
>>>>> osd.104(4) from (2151654'3111903,2154845'3112716] MIN to
>>>>> 2154845'3112716
>>>>> 2024-01-17T14:57:59.064+0000 7f644154e700  0
>>>>> log_channel(cluster) log [DBG] : 11.857s0 starting backfill to
>>>>> osd.128(3) from (2151654'3111903,2154278'3112714] MIN to
>>>>> 2154845'3112716
>>>>> 2024-01-17T14:57:59.067+0000 7f643fd4b700  0
>>>>> log_channel(cluster) log [DBG] : 11.5e4s0 starting backfill to
>>>>> osd.431(5) from (2151000'3067292,2154631'3068041] MIN to
>>>>> 2154848'3068052
>>>>> 2024-01-17T14:57:59.067+0000 7f644054c700  0
>>>>> log_channel(cluster) log [INF] : 11.cc8s0 continuing backfill to
>>>>> osd.184(5) from (2151675'2997168,2154838'2997942] MIN to
>>>>> 2154838'2997942
>>>>> 2024-01-17T14:57:59.067+0000 7f644054c700  0
>>>>> log_channel(cluster) log [DBG] : 11.cc8s0 starting backfill to
>>>>> osd.431(1) from (2151675'2997168,2154820'2997923] MIN to
>>>>> 2154838'2997942
>>>>> 2024-01-17T14:57:59.071+0000 7f643f54a700  0
>>>>> log_channel(cluster) log [DBG] : 11.883s0 starting backfill to
>>>>> osd.431(4) from (2151058'2837151,2154693'2837901] MIN to
>>>>> 2154845'2837919
>>>>> 2024-01-17T14:57:59.079+0000 7f6440d4d700  0
>>>>> log_channel(cluster) log [DBG] : 11.637s0 starting backfill to
>>>>> osd.431(5) from (2151654'3204863,2154801'3205673] MIN to
>>>>> 2154845'3205687
>>>>> 2024-01-17T14:57:59.106+0000 7f643fd4b700  0
>>>>> log_channel(cluster) log [INF] : 11.4f9s0 continuing backfill to
>>>>> osd.149(3) from (2151654'3169077,2154845'3169891] MIN to
>>>>> 2154845'3169891
>>>>> 2024-01-17T14:57:59.106+0000 7f643fd4b700  0
>>>>> log_channel(cluster) log [DBG] : 11.4f9s0 starting backfill to
>>>>> osd.431(4) from (2151654'3169077,2154183'3169887] MIN to
>>>>> 2154845'3169891
>>>>> 2024-01-17T14:57:59.122+0000 7f644054c700  0
>>>>> log_channel(cluster) log [DBG] : 11.c73s0 starting backfill to
>>>>> osd.431(3) from (2152025'3130484,2154822'3131249] MIN to
>>>>> 2154845'3131256
>>>>> 2024-01-17T14:57:59.135+0000 7f644154e700  0
>>>>> log_channel(cluster) log [INF] : 11.857s0 continuing backfill to
>>>>> osd.272(0) from (2151654'3111903,2154845'3112716] MIN to
>>>>> 2154845'3112716
>>>>> 2024-01-17T14:57:59.136+0000 7f644154e700  0
>>>>> log_channel(cluster) log [INF] : 11.857s0 continuing backfill to
>>>>> osd.333(2) from (2151654'3111903,2154845'3112716] MIN to
>>>>> 2154845'3112716
>>>>> 2024-01-17T14:57:59.136+0000 7f644154e700  0
>>>>> log_channel(cluster) log [INF] : 11.857s0 continuing backfill to
>>>>> osd.423(5) from (2151654'3111903,2154845'3112716] MIN to
>>>>> 2154845'3112716
>>>>> 2024-01-17T14:57:59.136+0000 7f644154e700  0
>>>>> log_channel(cluster) log [DBG] : 11.857s0 starting backfill to
>>>>> osd.431(1) from (2151654'3111903,2154278'3112714] MIN to
>>>>> 2154845'3112716
>>>>> 2024-01-17T14:57:59.161+0000 7f644154e700  0
>>>>> log_channel(cluster) log [INF] : 11.f8cs0 continuing backfill to
>>>>> osd.105(4) from (2152025'3065404,2154848'3066167] MIN to
>>>>> 2154848'3066167
>>>>> 2024-01-17T14:57:59.162+0000 7f644154e700  0
>>>>> log_channel(cluster) log [INF] : 11.f8cs0 continuing backfill to
>>>>> osd.421(2) from (2152025'3065404,2154848'3066167] MIN to
>>>>> 2154848'3066167
>>>>> 2024-01-17T14:57:59.162+0000 7f644154e700  0
>>>>> log_channel(cluster) log [DBG] : 11.f8cs0 starting backfill to
>>>>> osd.431(5) from (0'0,0'0] MAX to 2154848'3066167
>>>>> 2024-01-17T14:57:59.167+0000 7f643f54a700  0
>>>>> log_channel(cluster) log [DBG] : 37.6ees0 starting backfill to
>>>>> osd.431(8) from (0'0,0'0] MAX to 2154128'24444
>>>>> 2024-01-17T14:58:33.532+0000 7f6455d77700 -1 osd.11 2154852
>>>>> get_health_metrics reporting 1 slow ops, oldest is
>>>>> osd_op(client.1823267746.0:788334371 11.f8cs0 11.2dd6f8c
>>>>> (undecoded) ondisk+
>>>>> write+known_if_redirected e2154852)
>>>>> ...
>>>>> "
>>>>>
>>>>> Not sure if that provides any clues.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Mvh.
>>>>>
>>>>> Torkil
>>>>>
>>>>>> Regards,
>>>>>> Eugen
>>>>>>
>>>>>> [1] https://www.spinics.net/lists/ceph-users/msg71933.html
>>>>>>
>>>>>> Zitat von Ruben Vestergaard <rubenv@xxxxxxxx>:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> We have a cluster with which currently looks like so:
>>>>>>>
>>>>>>>     services:
>>>>>>>       mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 13d)
>>>>>>>       mgr: jolly.tpgixt(active, since 25h), standbys:
>>>>>>> dopey.lxajvk, lazy.xuhetq
>>>>>>>       mds: 1/1 daemons up, 2 standby
>>>>>>>       osd: 449 osds: 425 up (since 15m), 425 in (since 5m);
>>>>>>> 5104 remapped pgs
>>>>>>>         data:
>>>>>>>       volumes: 1/1 healthy
>>>>>>>       pools:   13 pools, 11153 pgs
>>>>>>>       objects: 304.11M objects, 988 TiB
>>>>>>>       usage:   1.6 PiB used, 1.4 PiB / 2.9 PiB avail
>>>>>>>       pgs:     6/1617270006 objects degraded (0.000%)
>>>>>>>                366696947/1617270006 objects misplaced (22.674%)
>>>>>>>                6043 active+clean
>>>>>>>                5041 active+remapped+backfill_wait
>>>>>>>                66   active+remapped+backfilling
>>>>>>>                2    active+recovery_wait+degraded+remapped
>>>>>>>                1    active+recovering+degraded
>>>>>>>
>>>>>>>
>>>>>>> It's currently rebalancing after adding a node, but this
>>>>>>> rebalance has been rather slow -- right now it's running 66
>>>>>>> backfills, but it seems to stabilize around 8 backfills
>>>>>>> eventually. We figured that perhaps adding another node might
>>>>>>> speed things up.
>>>>>>>
>>>>>>> Immediately upon adding the node, we get slow ops and inactive
>>>>>>> PG's. Removing the new node gets us back in working order.
>>>>>>>
>>>>>>> It turns out that even adding 1 OSD breaks the cluster, and
>>>>>>> immediately sends it here:
>>>>>>>
>>>>>>>     [WRN] PG_DEGRADED: Degraded data redundancy: 6/1617265712
>>>>>>> objects degraded (0.000%), 3 pgs degraded
>>>>>>>         pg 37.c8 is active+recovery_wait+degraded+remapped,
>>>>>>> acting [410,163,236,209,7,283,155,143,78]
>>>>>>>         pg 37.1a1 is active+recovering+degraded, acting
>>>>>>> [234,424,163,74,22,128,177,153,181]
>>>>>>>         pg 37.1da is active+recovery_wait+degraded+remapped,
>>>>>>> acting [163,408,230,190,93,284,50,78,44]
>>>>>>>     [WRN] SLOW_OPS: 22 slow ops, oldest one blocked for 54
>>>>>>> sec, daemons
>>>>>>> [osd.11,osd.110,osd.112,osd.117,osd.120,osd.123,osd.13,osd.
>>>>>>> 136,osd.144,osd.157]... have slow ops.
>>>>>>>
>>>>>>> The OSD added had number 431, so it does not appear to be the
>>>>>>> immediate cause of the slow ops, however, removing 431
>>>>>>> immediately clears the problem.
>>>>>>>
>>>>>>> We thought we might be experiencing 'Crush giving up too soon'
>>>>>>> symptoms [1], as we have seen similar behaviour on another
>>>>>>> pool, but it does not appear to be the case here. We went
>>>>>>> through the motions described on the page and everything
>>>>>>> looked OK.
>>>>>>>
>>>>>>> At least one pool which stops working is a 4+2 EC pool, placed
>>>>>>> on spinning rust, some 200-ish disks distributed across 13
>>>>>>> nodes. I'm not sure if other pools break, but that particular
>>>>>>> 4+2 EC pool is rather important so I'm a little wary of
>>>>>>> experimenting blindly.
>>>>>>>
>>>>>>> Any thoughts on where to look next?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ruben Vestergaard
>>>>>>>
>>>>>>> [1] https://docs.ceph.com/en/reef/rados/troubleshooting/
>>>>>>> troubleshooting-pg/#crush-gives-up-too-soon
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>
>>>>> --
>>>>> Torkil Svensgaard
>>>>> Systems Administrator
>>>>> Danish Research Centre for Magnetic Resonance DRCMR, Section 714
>>>>> Copenhagen University Hospital Amager and Hvidovre
>>>>> Kettegaard Allé 30, 2650 Hvidovre, Denmark
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>> --
>>> Torkil Svensgaard
>>> Sysadmin
>>> MR-Forskningssektionen, afs. 714
>>> DRCMR, Danish Research Centre for Magnetic Resonance
>>> Hvidovre Hospital
>>> Kettegård Allé 30
>>> DK-2650 Hvidovre
>>> Denmark
>>> Tel: +45 386 22828
>>> E-mail: torkil@xxxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> --
> Torkil Svensgaard
> Sysadmin
> MR-Forskningssektionen, afs. 714
> DRCMR, Danish Research Centre for Magnetic Resonance
> Hvidovre Hospital
> Kettegård Allé 30
> DK-2650 Hvidovre
> Denmark
> Tel: +45 386 22828
> E-mail: torkil@xxxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx