forever stuck "slow ops" osd

Arvid Picciani <aep@xxxxxxxx> · Thu, 16 Feb 2023 21:55:21 +0100

Hi,

today our entire cluster froze. or anything that uses librbd to be specific.
ceph version 16.2.10

The message that saved me was "256 slow ops, oldest one blocked for
2893 sec, osd.7 has slow ops" , because it makes it immediately clear
that this osd is the issue.

I stopped the osd, which made the cluster available again. Restarting
the osd makes it stuck again, although that osd has nothing in the
error log, and the underlying ssd is healthy. It's just that one out
of 27. There's nothing unique about it. We use the same disk product
in other osds, and the host is also running other osds just fine.

How does this happen, and why can the cluster not recover from this
automatically? For example by stopping the affected osd or at least
having a timeout for ops.

Thanks

-- 
+4916093821054
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx