Re: OSD crash with "no available blob id" / Zombie blobs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor,

Thanks for your response.
I realize that you already respond that to my colleague Wissem here :
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/WV3GIKRZIEWHULVTR5THNDGX3WD4T2FN/

That was really helpful to have a workaround.

As v15.2.16 has just been released, I wonder if progress has been done, but as you confirm, there's not so much reported issues and reproducible ones...

For our context :

We use EC 3+2 for OpenStack RBD, except for a cinder volume type that is only replicated.
We have WAL+DB on SSD, data on HDD.
We have 5 OSD servers with 8 HDD (SAS) + 2 SSD, 3 servers shared with OpenStack controllers for MON, MGR, RGW.

We don't use snapshots as routine tasks.

We use Versioning on Object Storage, accessed from apps with S3 protocol.

Last crash / repair, 01/03/2022. No logs mentioning zombies since.


Le 2022-03-03 12:30, Igor Fedotov a écrit :
Hi Gilles,

the PR you mentioned is present in Octopus but it looks like it's
ineffective/inappropriate against the issue.

Hence I think there is no much sense in upgrading to Pacific if that's
the only reason for you...


Actually I've been trying to catch the bug for a long time but that's
been unsuccessful so far. Not every cluster is affected by the issue,
apparently some tricky(?) use pattern triggers it.

Curious how long does it take for you OSDs to get to the assertion
after the repair?

And could you please share any additional details about cluster's
usage? So major use case is RBD, right? Replicated or EC pools? How
often snapshots are taken if any?


Thanks,

Igor

On 3/3/2022 1:45 PM, Gilles Mocellin wrote:
Hello !

On our Octopus (v15.2.15) cluster, mainly used for OpenStack,
We had several OSD crash.
Some would not restart with "no available blob id" assertion.

We found several related bugs :
https://tracker.ceph.com/issues/48216
https://tracker.ceph.com/issues/38272

The workaround that works is to fsck / repair the stopped OSD :
ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-<osd_id> --command repair

But it's not a long term solution.

I have seen a PR merged in 2019 here :
https://github.com/ceph/ceph/pull/28229

But don't find if it made to Octopus, and if it resolves completely the problem.

I also wonder if someone had that problem with Pacific, which could motivate us to upgrade from Octopus.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux