OSD crash on Onode::put

Dongdong Tao <dongdong.tao@xxxxxxxxxxxxx> · Mon, 9 Jan 2023 11:12:50 +0900

-- Resending this mail, it seems ceph-users@xxxxxxx was down for the last
few days.

I see many users recently reporting that they have been struggling with
this Onode::put race condition issue[1] on both the latest Octopus and
pacific.
Igor opened a PR [2]  to address this issue, I've been reviewing it for a
while and looks good to me.
I'm hoping this could get some priority from the community.

For those who had been hitting this issue, I would like to share a
workaround that could very likely unblock you:

During the investigation of this issue, I found this race condition always
happens after the bluestore onode cache size becomes 0.
Setting debug_bluestore = 1/30 will allow you to see the cache size after
the crash:
---
2022-10-25T00:47:26.562+0000 7f424f78e700 30
bluestore.MempoolThread(0x564a9dae2a68)
_resize_shards max_shard_onodes: 0 max_shard_buffer: 8388608
---

This is apparently wrong as this means the bluestore metadata cache is
basically disabled,
but it makes much sense to explain why we are hitting the race condition so
easily -- An onode will be trimmed right away after it's unpinned.

Keep going with the investigation, it turns out the culprit for the 0-sized
cache is the leak that happened in bluestore_cache_other mempool
Please refer to the bug tracker [3] which has the detail of the leak issue,
it was already fixed by  [4], and the next Pacific point release will have
it.
But it was never backported to Octopus, so if you are hitting the same:
For Octopus, you can manually backport this patch to fix the leak and
prevent the race condition from happening.
For Pacific, you can wait for 16.2.11 (or manually backport the fix as well
if can't wait).

By the way, I'm backporting the fix to ubuntu Octopus and Pacific through
this SRU [5], so it will be landed in ubuntu's package soon.

[1] https://tracker.ceph.com/issues/56382
[2] https://github.com/ceph/ceph/pull/47702
[3] https://tracker.ceph.com/issues/56424
[4] https://github.com/ceph/ceph/pull/46911
[5] https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1996010

Cheers,
Dongdong
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx