Re: OSD crash with "no available blob id" / Zombie blobs

Gilles Mocellin <gilles.mocellin@xxxxxxxxxxxxxx> · Thu, 03 Mar 2022 18:04:23 +0100

Hi Igor,

Thanks for your response.
I realize that you already respond that to my colleague Wissem here :
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/WV3GIKRZIEWHULVTR5THNDGX3WD4T2FN/

That was really helpful to have a workaround.

As v15.2.16 has just been released, I wonder if progress has been done, 
but as you confirm, there's not so much reported issues and reproducible 
ones...

For our context :

We use EC 3+2 for OpenStack RBD, except for a cinder volume type that is 
only replicated.
We have WAL+DB on SSD, data on HDD.
We have 5 OSD servers with 8 HDD (SAS) + 2 SSD, 3 servers shared with 
OpenStack controllers for MON, MGR, RGW.

We don't use snapshots as routine tasks.

We use Versioning on Object Storage, accessed from apps with S3 
protocol.

Last crash / repair, 01/03/2022. No logs mentioning zombies since.

Le 2022-03-03 12:30, Igor Fedotov a écrit :
Hi Gilles,

the PR you mentioned is present in Octopus but it looks like it's
ineffective/inappropriate against the issue.

Hence I think there is no much sense in upgrading to Pacific if that's
the only reason for you...

Actually I've been trying to catch the bug for a long time but that's
been unsuccessful so far. Not every cluster is affected by the issue,
apparently some tricky(?) use pattern triggers it.

Curious how long does it take for you OSDs to get to the assertion
after the repair?

And could you please share any additional details about cluster's
usage? So major use case is RBD, right? Replicated or EC pools? How
often snapshots are taken if any?

Thanks,

Igor

On 3/3/2022 1:45 PM, Gilles Mocellin wrote:
Hello !

On our Octopus (v15.2.15) cluster, mainly used for OpenStack,
We had several OSD crash.
Some would not restart with "no available blob id" assertion.

We found several related bugs :
https://tracker.ceph.com/issues/48216
https://tracker.ceph.com/issues/38272

The workaround that works is to fsck / repair the stopped OSD :
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-<osd_id> --command 
repair

But it's not a long term solution.

I have seen a PR merged in 2019 here :
https://github.com/ceph/ceph/pull/28229

But don't find if it made to Octopus, and if it resolves completely 
the problem.

I also wonder if someone had that problem with Pacific, which could 
motivate us to upgrade from Octopus.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx