Re: After upgrading from 17.2.6 to 18.2.0, OSDs are very frequently restarting due to livenessprobe failures

Peter Goron <peter.goron@xxxxxxxxx> · Fri, 22 Sep 2023 16:38:14 +0200

Hi,

For the record, in the past we faced a similar issue with OSDs being killed
one after each other every day starting from midnight.
The root cause was linked to device_health_check launched by mgr on each
OSD.
While OSD is doing device_health_check, OSD admin socket is busy and can't
answer to other commands  (especially the ones sent by the liveness probe).

Default liveness probe timeout setup by rook is probably too small in
regards of device_health_check duration.
In our case, we disabled device_health_check on mgr side.

Rgds,
Peter

Le jeu. 21 sept. 2023 à 21:35, Sudhin Bengeri <sbengeri@xxxxxxxxx> a écrit :

> Igor, Travis,
>
> Thanks for your attention to this issue.
>
> We extended the timeout for the liveness probe yesterday, and also extended
> the time after which a down OSD deployment is deleted by the operator. Once
> all the OSD deployments were recreated by the operator, we observed two OSD
> restarts - which is a much lower rate than earlier.
>
> Igor, we are still working on piecing together logs (from our log store)
> before the OSD restarts and will send them shortly.
>
> Thanks.
> Sudhin
>
>
>
>
> On Thu, Sep 21, 2023 at 3:12 PM Travis Nielsen <tnielsen@xxxxxxxxxx>
> wrote:
>
> > If there is nothing obvious in the OSD logs such as failing to start, and
> > if the OSDs appear to be running until the liveness probe restarts them,
> > you could disable or change the timeouts on the liveness probe. See
> >
> https://rook.io/docs/rook/latest/CRDs/Cluster/ceph-cluster-crd/#health-settings
> > .
> >
> > But of course, we need to understand if there is some issue with the
> OSDs.
> > Please open a Rook issue if it appears related to the liveness probe.
> >
> > Travis
> >
> > On Thu, Sep 21, 2023 at 3:12 AM Igor Fedotov <igor.fedotov@xxxxxxxx>
> > wrote:
> >
> >> Hi!
> >>
> >> Can you share OSD logs demostrating such a restart?
> >>
> >>
> >> Thanks,
> >>
> >> Igor
> >>
> >> On 20/09/2023 20:16, sbengeri@xxxxxxxxx wrote:
> >> > Since upgrading to 18.2.0 , OSDs are very frequently restarting due to
> >> livenessprobe failures making the cluster unusable. Has anyone else seen
> >> this behavior?
> >> >
> >> > Upgrade path: ceph 17.2.6 to 18.2.0 (and rook from 1.11.9 to 1.12.1)
> >> > on ubuntu 20.04 kernel 5.15.0-79-generic
> >> >
> >> > Thanks.
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> >>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx