Re: OSD crash on Onode::put

Frank Schilder <frans@xxxxxx> · Wed, 11 Jan 2023 10:28:50 +0000

Hi Dongdong.

> is simple and can be applied cleanly.

I understand this statement from a developer's perspective. Now, try to explain to a user with a cephadm deployed containerized cluster how to build a container from source, point cephadm to use this container and what to do for the next upgrade. I think "simple" depends on context. Applying a patch to a production system is currently an expert operation, I'm afraid.

If you have instructions for building a ceph-container with the patch applied, I would be very interested. I was asking for a source container for exactly this reason. As far as I can tell from the conversation, this is quite a project in itself. The thread was "Re: Building ceph packages in containers? [was: Ceph debian/ubuntu packages build]", but I can't find it on the mailing list any more. There seems to be an archived version: https://www.spinics.net/lists/ceph-users/msg73231.html

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Dongdong Tao <dongdong.tao@xxxxxxxxxxxxx>
Sent: 11 January 2023 04:30:14
To: Frank Schilder
Cc: Igor Fedotov; ceph-users@xxxxxxx; cobanserkan@xxxxxxxxx
Subject: Re:  Re: OSD crash on Onode::put

Hi Frank,

I don't have an operational workaround, the patch https://github.com/ceph/ceph/pull/46911/commits/f43f596aac97200a70db7a70a230eb9343018159 is simple and can be applied cleanly.

Yes, restarting the OSD will clear pool entries, you can restart it when the bluestore_onode items are very low (e.g less than 10) if it really helps, but I think you'll need to tune and monitor the performance until you can get a number that is most suitable for your cluster.

But it can't help with the crash, since in general, the crash itself is basically a restart.

Regards,
Dongdong

On Tue, Jan 10, 2023 at 8:21 PM Serkan Çoban <cobanserkan@xxxxxxxxx<mailto:cobanserkan@xxxxxxxxx>> wrote:
Slot 19 is inside the chassis? Do you check chassis temperature? I
sometimes have more failure rate in chassis HDDs than in front of the
chassis. In our case it was related to the temperature difference.

On Tue, Jan 10, 2023 at 1:28 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
>
> Following up on my previous post, we have identical OSD hosts. The very strange observation now is, that all outlier OSDs are in exactly the same disk slot on these hosts. We have 5 problematic OSDs and they are all in slot 19 on 5 different hosts. This is an extremely strange and unlikely co-incidence.
>
> Are there any specific conditions for this problem to be present or amplified that could have to do with hardware?
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx