Folks, I am running into a very strange issue with a brand new Ceph cluster during initial testing. Cluster consists of 12 nodes, 4 of them have SSDs only, the other eight have a mixture of SSDs and HDDs. The latter nods are configured so that three or four HDDs use one SSDs for their blockdb. Ceph version is Nautilus. When writing to the cluster, clients will, in regular intervals, run into I/O stall (i.e. writes will take up to 25 minutes to complete). Deleting RBD Images will often take forever as well. After several weeks of debugging, what I can say from looking at the log files, is that what appears to take a lot of time is writing stuff to OSDs: "time": "2020-05-20 10:52:23.211006", "event": "reached_pg" }, { "time": "2020-05-20 10:52:23.211047", "event": "waiting for ondisk" }, { "time": "2020-05-20 10:53:35.369081", "event": "done" } But these machines are I/O idling. there is almost no I/O happening at all according to sysstat. I am slowly growing a bit desperate over this, and hence I wonder whether anybody has ever seen a similar issue? Or are there possibly any tips on where to carry on with debugging? Servers are from Dell with PERC controllers in HBA mode. The primary purpose of this Ceph cluster is to serve as backing storage for OpenStack, and to this point, I was not able to reproduce the issue with the SSD-only nodes. Best regards Martin _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx