Boris, I have seen one problematic OSD cause this issue on all OSD with which its PGs peered. The solution was to take out the slow OSD, immediately all slow ops stopped. I found it by observing common OSDs in reported slow ops. Not saying this is your issue, but it may be a possibility. Good luck! -- Alex Gorbachev https://alextelescope.blogspot.com On Fri, Dec 2, 2022 at 7:54 PM Boris Behrens <bb@xxxxxxxxx> wrote: > hi, > maybe someone here can help me to debug an issue we faced today. > > Today one of our clusters came to a grinding halt with 2/3 of our OSDs > reporting slow ops. > Only option to get it back to work fast, was to restart all OSDs daemons. > > The cluster is an octopus cluster with 150 enterprise SSD OSDs. Last work > on the cluster: synced in a node 4 days ago. > > The only health issue, that was reported, was the SLOW_OPS. No slow pings > on the networks. No restarting OSDs. Nothing. > > I was able to ping it to a 20s timeframe and I read ALL the logs in a 20 > minute timeframe around this issue. > > I haven't found any clues. > > Maybe someone encountered this in the past? > > -- > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im > groüen Saal. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx