Re: OSD crash on Onode::put

Dongdong Tao <dongdong.tao@xxxxxxxxxxxxx> · Wed, 11 Jan 2023 12:30:14 +0900

Hi Frank,

I don't have an operational workaround, the patch
https://github.com/ceph/ceph/pull/46911/commits/f43f596aac97200a70db7a70a230eb9343018159
is simple and can be applied cleanly.

Yes, restarting the OSD will clear pool entries, you can restart it when
the bluestore_onode items are very low (e.g less than 10) if it really
helps, but I think you'll need to tune and monitor the performance until
you can get a number that is most suitable for your cluster.

But it can't help with the crash, since in general, the crash itself is
basically a restart.

Regards,
Dongdong

On Tue, Jan 10, 2023 at 8:21 PM Serkan Çoban <cobanserkan@xxxxxxxxx> wrote:

> Slot 19 is inside the chassis? Do you check chassis temperature? I
> sometimes have more failure rate in chassis HDDs than in front of the
> chassis. In our case it was related to the temperature difference.
>
> On Tue, Jan 10, 2023 at 1:28 PM Frank Schilder <frans@xxxxxx> wrote:
> >
> > Following up on my previous post, we have identical OSD hosts. The very
> strange observation now is, that all outlier OSDs are in exactly the same
> disk slot on these hosts. We have 5 problematic OSDs and they are all in
> slot 19 on 5 different hosts. This is an extremely strange and unlikely
> co-incidence.
> >
> > Are there any specific conditions for this problem to be present or
> amplified that could have to do with hardware?
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx