Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Sat, 3 Dec 2022 21:15:07 -0500

Boris, I have seen one problematic OSD cause this issue on all OSD with
which its PGs peered.  The solution was to take out the slow OSD,
immediately all slow ops stopped.  I found it by observing common OSDs in
reported slow ops.  Not saying this is your issue, but it may be a
possibility.  Good luck!

--
Alex Gorbachev
https://alextelescope.blogspot.com

On Fri, Dec 2, 2022 at 7:54 PM Boris Behrens <bb@xxxxxxxxx> wrote:

> hi,
> maybe someone here can help me to debug an issue we faced today.
>
> Today one of our clusters came to a grinding halt with 2/3 of our OSDs
> reporting slow ops.
> Only option to get it back to work fast, was to restart all OSDs daemons.
>
> The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last work
> on the cluster: synced in a node 4 days ago.
>
> The only health issue, that was reported, was the SLOW_OPS. No slow pings
> on the networks. No restarting OSDs. Nothing.
>
> I was able to ping it to a 20s timeframe and I read ALL the logs in a 20
> minute timeframe around this issue.
>
> I haven't found any clues.
>
> Maybe someone encountered this in the past?
>
> --
> Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
> groÃƒ¼en Saal.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx