Hi, today our entire cluster froze. or anything that uses librbd to be specific. ceph version 16.2.10 The message that saved me was "256 slow ops, oldest one blocked for 2893 sec, osd.7 has slow ops" , because it makes it immediately clear that this osd is the issue. I stopped the osd, which made the cluster available again. Restarting the osd makes it stuck again, although that osd has nothing in the error log, and the underlying ssd is healthy. It's just that one out of 27. There's nothing unique about it. We use the same disk product in other osds, and the host is also running other osds just fine. How does this happen, and why can the cluster not recover from this automatically? For example by stopping the affected osd or at least having a timeout for ops. Thanks -- +4916093821054 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx