Seriously degraded performance after update to Octopus

Martin Rasmus Lundquist Hansen <hansen@xxxxxxxxxxxx> · Mon, 2 Nov 2020 06:53:12 +0000

Two weeks ago we updated our Ceph cluster from Nautilus (14.2.0) to Octopus (15.2.5), an update that was long overdue. We used the Ansible playbooks to perform a rolling update and except from a few minor problems with the Ansible code, the update went well. The Ansible playbooks were also used for setting up the cluster in the first place. Before updating the Ceph software we also performed a full update of CentOS and the Linux kernel (this part of the update had already been tested on one of the OSD nodes the week before and we didn't notice any problems).

However, after the update we are seeing a serious decrease in performance, more than a factor of 10x in some cases. I spend a week trying to come up with an explantion or solution, but I am completely blank. Independently of Ceph I tested the network performance and the performance of the OSD disks, and I am not really seeing any problems here.

The specifications of the cluster is:
- 3x Monitor nodes running mgr+mon+mds (Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz, 16 cores, 196 GB RAM)
- 14x OSD nodes, each with 18 HDDs and 1 NVME (Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz, 24 cores, 384 GB RAM)
- CentOS 7.8 and Kernel 5.4.51
- 100 Gbps Infiniband

We are collecting various metrics using Prometheus, and on the OSD nodes we are seeing some clear differences when it comes to CPU and Memory usage. I collected some graphs here: http://mitsted.dk/ceph . After the update the system load is highly reduced, there is almost no longer any iowait for the CPU, and the free memory is no longer used for Buffers (I can confirm that the changes in these metrics are not due to the update of CentOS or the Linux kernel). All in all, now the OSD nodes are almost completely idle all the time (and so are the monitors). On the linked page I also attached two RADOS benchmarks. The first benchmark was performed when the cluster was initially configured, and the second is the same benchmark after the update to Octopus. When comparing these two, it is clear that the performance has changed dramatically. For example, in the write test the bandwidth is reduced from 320 MB/s to 21 MB/s and the number of IOPS has also dropped significantly.

I temporarily tried to disable the firewall and SELinux on all nodes to see if it made any difference, but it didn’t look like it (I did not restart any services during this test, I am not sure if that could be necessary).

Any suggestions for finding the root cause of this performance decrease would be greatly appreciated.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx