Re: OSD crash on Onode::put

Igor Fedotov <igor.fedotov@xxxxxxxx> · Thu, 12 Jan 2023 15:07:11 +0300

Hi Frank,

IMO all the below logic is a bit of overkill and no one can provide 100% 
valid guidance on specific numbers atm. Generally I agree with 
Dongdong's point that crash is effectively an OSD restart and hence no 
much sense to perform such a restart manually - well, the rationale 
might be to do that gracefully and avoid some potential issues though...

Anyway I'd rather recommend to do periodic(!) manual OSD restart e.g. on 
a daily basis at off-peak hours instead of using tricks with mempool 
stats analysis..

Thanks,

Igor

On 1/10/2023 1:15 PM, Frank Schilder wrote:
Hi Dongdong and Igor,

thanks for pointing to this issue. I guess if its a memory leak issue (well, cache pool trim issue), checking for some indicator and an OSD restart should be a work-around? Dongdong promised a work-around but talks only about a patch (fix).

Looking at the tracker items, my conclusion is that unusually low values of .mempool.by_pool.bluestore_cache_onode.items of an OSD might be such an indicator. I just run a very simple check on all our OSDs:

for o in $(ceph osd ls); do n_onode="$(ceph tell "osd.$o" dump_mempools | jq ".mempool.by_pool.bluestore_cache_onode.items")"; echo -n "$o: "; ((n_onode<100000)) && echo "$n_onode"; done; echo ""

and found 2 with seemingly very unusual values:

1111: 3098
1112: 7403

Comparing two OSDs with same disk on the same host gives:

# ceph daemon osd.1111 dump_mempools | jq ".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
3200
1971200
260924
900303680

# ceph daemon osd.1030 dump_mempools | jq ".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
60281
37133096
8908591
255862680

OSD 1111 does look somewhat bad. Shortly after restarting this OSD I get

# ceph daemon osd.1111 dump_mempools | jq ".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_onode.bytes,.mempool.by_pool.bluestore_cache_other.items,.mempool.by_pool.bluestore_cache_other.bytes"
20775
12797400
803582
24017100

So, the above procedure seems to work and, yes, there seems to be a leak of items in cache_other that pushes other pools down to 0. There seem to be 2 useful indicators:

- very low .mempool.by_pool.bluestore_cache_onode.items
- very high .mempool.by_pool.bluestore_cache_other.bytes/.mempool.by_pool.bluestore_cache_other.items

Here a command to get both numbers with OSD ID in an awk-friendly format:

for o in $(ceph osd ls); do printf "%6d %8d %7.2f\n" "$o" $(ceph tell "osd.$o" dump_mempools | jq ".mempool.by_pool.bluestore_cache_onode.items,.mempool.by_pool.bluestore_cache_other.bytes/.mempool.by_pool.bluestore_cache_other.items"); done

Pipe it to a file and do things like:

awk '$2<50000 || $3>200' FILE

For example, I still get:

# awk '$2<50000 || $3>200' cache_onode.txt
   1092    49225   43.74
   1093    46193   43.70
   1098    47550   43.47
   1101    48873   43.34
   1102    48008   43.31
   1103    48152   43.29
   1105    49235   43.59
   1107    46694   43.35
   1109    48511   43.08
   1113    14612  739.46
   1114    13199  693.76
   1116    45300  205.70

flagging 3 more outliers.

Would it be possible to provide a bit of guidance to everyone about when to consider restarting an OSD? What values of the above variables are critical and what are tolerable? Of course a proper fix would be better, but I doubt that everyone is willing to apply a patch. Therefore, some guidance on how to mitigate this problem to acceptable levels might be useful. I'm thinking here how few onode items are acceptable before performance drops painfully.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov<igor.fedotov@xxxxxxxx>
Sent: 09 January 2023 13:34:42
To: Dongdong Tao;ceph-users@xxxxxxx
Cc:dev@xxxxxxx
Subject:  Re: OSD crash on Onode::put

Hi Dongdong,

thanks a lot for your post, it's really helpful.

Thanks,

Igor

On 1/5/2023 6:12 AM, Dongdong Tao wrote:
I see many users recently reporting that they have been struggling
with this Onode::put race condition issue[1] on both the latest
Octopus and pacific.
Igor opened a PR [2]  to address this issue, I've reviewed it
carefully, and looks good to me. I'm hoping this could get some
priority from the community.

For those who had been hitting this issue, I would like to share a
workaround that could unblock you:

During the investigation of this issue, I found this race condition
always happens after the bluestore onode cache size becomes 0.
Setting debug_bluestore = 1/30 will allow you to see the cache size
after the crash:
---
2022-10-25T00:47:26.562+0000 7f424f78e700 30
bluestore.MempoolThread(0x564a9dae2a68) _resize_shards
max_shard_onodes: 0 max_shard_buffer: 8388608
---

This is apparently wrong as this means the bluestore metadata cache is
basically disabled,
but it makes much sense to explain why we are hitting the race
condition so easily -- An onode will be trimmed right away after it's
unpinned.

Keep going with the investigation, it turns out the culprit for the
0-sized cache is the leak that happened in bluestore_cache_other mempool
Please refer to the bug tracker [3] which has the detail of the leak
issue, it was already fixed by  [4], and the next Pacific point
release will have it.
But it was never backported to Octopus.
So if you are hitting the same:
For those who are on Octopus, you can manually backport this patch to
fix the leak and prevent the race condition from happening.
For those who are on Pacific, you can wait for the next Pacific point
release.

By the way, I'm backporting the fix to ubuntu Octopus and Pacific
through this SRU [5], so it will be landed in ubuntu's package soon.

[1]https://tracker.ceph.com/issues/56382
[2]https://github.com/ceph/ceph/pull/47702
[3]https://tracker.ceph.com/issues/56424
[4]https://github.com/ceph/ceph/pull/46911
[5]https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1996010

Cheers,
Dongdong

--
Igor Fedotov
Ceph Lead Developer
--
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web<https://croit.io/>  | LinkedIn<http://linkedin.com/company/croit>  |
Youtube<https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw>  |
Twitter<https://twitter.com/croit_io>

Meet us at the SC22 Conference! Learn more<https://croit.io/croit-sc22>
Technology Fast50 Award Winner by Deloitte
<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>!

<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
_______________________________________________
ceph-users mailing list --ceph-users@xxxxxxx
To unsubscribe send an email toceph-users-leave@xxxxxxx
--
Igor Fedotov
Ceph Lead Developer
--
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web <https://croit.io/> | LinkedIn <http://linkedin.com/company/croit> | 
Youtube <https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> | 
Twitter <https://twitter.com/croit_io>

Meet us at the SC22 Conference! Learn more <https://croit.io/croit-sc22>
Technology Fast50 Award Winner by Deloitte 
<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>!

<https://www2.deloitte.com/de/de/pages/technology-media-and-telecommunications/articles/fast-50-2022-germany-winners.html>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx