octopus rbd cluster just stopped out of nowhere (>20k slow ops)

Boris Behrens <bb@xxxxxxxxx> · Sat, 3 Dec 2022 01:54:01 +0100

hi,
maybe someone here can help me to debug an issue we faced today.

Today one of our clusters came to a grinding halt with 2/3 of our OSDs
reporting slow ops.
Only option to get it back to work fast, was to restart all OSDs daemons.

The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last work
on the cluster: synced in a node 4 days ago.

The only health issue, that was reported, was the SLOW_OPS. No slow pings
on the networks. No restarting OSDs. Nothing.

I was able to ping it to a 20s timeframe and I read ALL the logs in a 20
minute timeframe around this issue.

I haven't found any clues.

Maybe someone encountered this in the past?

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx