Hi Igor, my approach here, before doing something crazy like a daily cron job for restarting OSDs, is to do at least a minimum of thread analysis. How much of a problem is it really? I'm here also mostly guided by performance loss. As far as I know, the onode cache should be one of the most important caches regarding performance. Of course, only if the hit-rate is decent and this I can't pull out. Since I can't check the hit-rate, the second best thing is to see how an OSD's item count compares with average, how it develops on a restarted OSD and so on to get an idea what is normal, what is degraded and what requires action. As far as I can tell after this relatively short amount of time, the item leak is a rather mild problem on our cluster. The few OSDs that were exceptional are all OSDs that were newly deployed and not restarted since backfill completed. It seems that backfill is an operation that triggers a measurable amount of cache_other to be lost to cleanup. Otherwise, a restart every 2-3 months might be warranted. Since we plan to upgrade to pacific this summer, this means not too much needs to be done. I will just keep an eye on onode item counts and restart one or the other OSD when warranted. About "its just a restart". Most of the time it is. However, there was just recently a case where a restart meant complete loss of an OSD. The bug causing the restart corrupted the rocks DB beyond repair. Therefore, I think its always worth checking, doing some thread analysis and preventing unintended restarts if possible. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Igor Fedotov <igor.fedotov@xxxxxxxx> Sent: 12 January 2023 13:07:11 To: Frank Schilder; Dongdong Tao; ceph-users@xxxxxxx Cc: dev@xxxxxxx Subject: Re: [ceph-users] Re: OSD crash on Onode::put Hi Frank, IMO all the below logic is a bit of overkill and no one can provide 100% valid guidance on specific numbers atm. Generally I agree with Dongdong's point that crash is effectively an OSD restart and hence no much sense to perform such a restart manually - well, the rationale might be to do that gracefully and avoid some potential issues though... Anyway I'd rather recommend to do periodic(!) manual OSD restart e.g. on a daily basis at off-peak hours instead of using tricks with mempool stats analysis.. Thanks, Igor On 1/10/2023 1:15 PM, Frank Schilder wrote: Hi Dongdong and Igor, thanks for pointing to this issue. I guess if its a memory leak issue (well, cache pool trim issue), checking for some indicator and an OSD restart should be a work-around? Dongdong promised a work-around but talks only about a patch (fix). Looking at the tracker items, my conclusion is that unusually low values of .mempool.by_pool.bluestore_cache_onode.items of an OSD might be such an indicator. I just run a very simple check on all our OSDs: for o in $(ceph osd ls); do n_onode="$(ceph tell "osd.$o" dump_mempools | jq ".mempool.by_pool.bluestore_cache_onode.items")"; echo -n "$o: "; ((n_onode<100000)) && echo "$n_onode"; done; echo "" and found 2 with seemingly very unusual values: 1111: 3098 1112: 7403 Comparing two OSDs with same disk on the same host gives: # ceph daemon osd.1111 dump_mempools | jq ".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes" 3200 1971200 260924 900303680 # ceph daemon osd.1030 dump_mempools | jq ".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes" 60281 37133096 8908591 255862680 OSD 1111 does look somewhat bad. Shortly after restarting this OSD I get # ceph daemon osd.1111 dump_mempools | jq ".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes" 20775 12797400 803582 24017100 So, the above procedure seems to work and, yes, there seems to be a leak of items in cache_other that pushes other pools down to 0. There seem to be 2 useful indicators: - very low .mempool.by_pool.bluestore_cache_onode.items - very high .mempool.by_pool.bluestore_cache_other.bytes/.mempool.by_pool.bluestore_cache_other.items Here a command to get both numbers with OSD ID in an awk-friendly format: for o in $(ceph osd ls); do printf "%6d %8d %7.2f\n" "$o" $(ceph tell "osd.$o" dump_mempools | jq ".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_other.bytes/.mempool.by_pool.bluestore_cache_other.items"); done Pipe it to a file and do things like: awk '$2<50000 || $3>200' FILE For example, I still get: # awk '$2<50000 || $3>200' cache_onode.txt 1092 49225 43.74 1093 46193 43.70 1098 47550 43.47 1101 48873 43.34 1102 48008 43.31 1103 48152 43.29 1105 49235 43.59 1107 46694 43.35 1109 48511 43.08 1113 14612 739.46 1114 13199 693.76 1116 45300 205.70 flagging 3 more outliers. Would it be possible to provide a bit of guidance to everyone about when to consider restarting an OSD? What values of the above variables are critical and what are tolerable? Of course a proper fix would be better, but I doubt that everyone is willing to apply a patch. Therefore, some guidance on how to mitigate this problem to acceptable levels might be useful. I'm thinking here how few onode items are acceptable before performance drops painfully. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Igor Fedotov <igor.fedotov@xxxxxxxx><mailto:igor.fedotov@xxxxxxxx> Sent: 09 January 2023 13:34:42 To: Dongdong Tao; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> Cc: dev@xxxxxxx<mailto:dev@xxxxxxx> Subject: [ceph-users] Re: OSD crash on Onode::put Hi Dongdong, thanks a lot for your post, it's really helpful. Thanks, Igor On 1/5/2023 6:12 AM, Dongdong Tao wrote: I see many users recently reporting that they have been struggling with this Onode::put race condition issue[1] on both the latest Octopus and pacific. Igor opened a PR [2] to address this issue, I've reviewed it carefully, and looks good to me. I'm hoping this could get some priority from the community. For those who had been hitting this issue, I would like to share a workaround that could unblock you: During the investigation of this issue, I found this race condition always happens after the bluestore onode cache size becomes 0. Setting debug_bluestore = 1/30 will allow you to see the cache size after the crash: --- 2022-10-25T00:47:26.562+0000 7f424f78e700 30 bluestore.MempoolThread(0x564a9dae2a68) _resize_shards max_shard_onodes: 0 max_shard_buffer: 8388608 --- This is apparently wrong as this means the bluestore metadata cache is basically disabled, but it makes much sense to explain why we are hitting the race condition so easily -- An onode will be trimmed right away after it's unpinned. Keep going with the investigation, it turns out the culprit for the 0-sized cache is the leak that happened in bluestore_cache_other mempool Please refer to the bug tracker [3] which has the detail of the leak issue, it was already fixed by [4], and the next Pacific point release will have it. But it was never backported to Octopus. So if you are hitting the same: For those who are on Octopus, you can manually backport this patch to fix the leak and prevent the race condition from happening. For those who are on Pacific, you can wait for the next Pacific point release. By the way, I'm backporting the fix to ubuntu Octopus and Pacific through this SRU [5], so it will be landed in ubuntu's package soon. [1] https://tracker.ceph.com/issues/56382 [2] https://github.com/ceph/ceph/pull/47702 [3] https://tracker.ceph.com/issues/56424 [4] https://github.com/ceph/ceph/pull/46911 [5] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1996010 Cheers, Dongdong -- Igor Fedotov Ceph Lead Developer -- croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web <https://croit.io/><https://croit.io/> | LinkedIn <http://linkedin.com/company/croit><http://linkedin.com/company/croit> | Youtube <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw><https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> | Twitter <https://twitter.com/croit_io><https://twitter.com/croit_io> Meet us at the SC22 Conference! Learn more <https://croit.io/croit-sc22><https://croit.io/croit-sc22> Technology Fast50 Award Winner by Deloitte <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html><https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>! <https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html><https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx> -- Igor Fedotov Ceph Lead Developer -- croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web<https://croit.io/> | LinkedIn<http://linkedin.com/company/croit> | Youtube<https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> | Twitter<https://twitter.com/croit_io> Meet us at the SC22 Conference! Learn more<https://croit.io/croit-sc22> Technology Fast50 Award Winner by Deloitte<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>! [https://ci3.googleusercontent.com/mail-sig/AIorK4ycsiG9P_8eFvFOIru1eju7PM0hzyNZguqc8Bqa1zRaNmOHdCLQnejd9_E3NVpGs5wqlxtGlJ4]<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html> _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx