Performance issues in newly deployed Ceph cluster

"Loschwitz,Martin Gerhard" <Martin.Loschwitz@xxxxxxxx> · Tue, 26 May 2020 09:59:10 +0000

Folks,

I am running into a very strange issue with a brand new Ceph cluster during initial testing. Cluster
consists of 12 nodes, 4 of them have SSDs only, the other eight have a mixture of SSDs and HDDs.
The latter nods are configured so that three or four HDDs use one SSDs for their blockdb.

Ceph version is Nautilus.

When writing to the cluster, clients will, in regular intervals, run into I/O stall (i.e. writes will take up
to 25 minutes to complete). Deleting RBD Images will often take forever as well. After several weeks
of debugging, what I can say from looking at the log files, is that what appears to take a lot of time
is writing stuff to OSDs:

                    "time": "2020-05-20 10:52:23.211006", 
                        "event": "reached_pg"
                    },
                    {
                        "time": "2020-05-20 10:52:23.211047",
                        "event": "waiting for ondisk"
                    },
                    {
                        "time": "2020-05-20 10:53:35.369081",
                        "event": "done"
                    }

But these machines are I/O idling. there is almost no I/O happening at all according to sysstat.
I am slowly growing a bit desperate over this, and hence I wonder whether anybody has ever
seen a similar issue? Or are there possibly any tips on where to carry on with debugging?

Servers are from Dell with PERC controllers in HBA mode.

The primary purpose of this Ceph cluster is to serve as backing storage for OpenStack, and to
this point, I was not able to reproduce the issue with the SSD-only nodes.

Best regards
Martin
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx