Re: Seriously degraded performance after update to Octopus

"Marc Roos" <M.Roos@xxxxxxxxxxxxxxxxx> · Mon, 2 Nov 2020 09:54:29 +0100

I am advocating already a long time for publishing testing data of some 
basic test cluster against different ceph releases. Just a basic ceph 
cluster that covers most configs and run the same tests, so you can 
compare just ceph performance. That would mean a lot for smaller 
companies that do not have access to a good test environment. I have 
asked also about this at some ceph seminar.

-----Original Message-----
From: Martin Rasmus Lundquist Hansen [mailto:hansen@xxxxxxxxxxxx] 
Sent: Monday, November 02, 2020 7:53 AM
To: ceph-users@xxxxxxx
Subject:  Seriously degraded performance after update to 
Octopus

Two weeks ago we updated our Ceph cluster from Nautilus (14.2.0) to 
Octopus (15.2.5), an update that was long overdue. We used the Ansible 
playbooks to perform a rolling update and except from a few minor 
problems with the Ansible code, the update went well. The Ansible 
playbooks were also used for setting up the cluster in the first place. 
Before updating the Ceph software we also performed a full update of 
CentOS and the Linux kernel (this part of the update had already been 
tested on one of the OSD nodes the week before and we didn't notice any 
problems).

However, after the update we are seeing a serious decrease in 
performance, more than a factor of 10x in some cases. I spend a week 
trying to come up with an explantion or solution, but I am completely 
blank. Independently of Ceph I tested the network performance and the 
performance of the OSD disks, and I am not really seeing any problems 
here.

The specifications of the cluster is:
- 3x Monitor nodes running mgr+mon+mds (Intel(R) Xeon(R) Silver 4108 CPU 
@ 1.80GHz, 16 cores, 196 GB RAM)
- 14x OSD nodes, each with 18 HDDs and 1 NVME (Intel(R) Xeon(R) Gold 
6126 CPU @ 2.60GHz, 24 cores, 384 GB RAM)
- CentOS 7.8 and Kernel 5.4.51
- 100 Gbps Infiniband

We are collecting various metrics using Prometheus, and on the OSD nodes 
we are seeing some clear differences when it comes to CPU and Memory 
usage. I collected some graphs here: http://mitsted.dk/ceph . After the 
update the system load is highly reduced, there is almost no longer any 
iowait for the CPU, and the free memory is no longer used for Buffers (I 
can confirm that the changes in these metrics are not due to the update 
of CentOS or the Linux kernel). All in all, now the OSD nodes are almost 
completely idle all the time (and so are the monitors). On the linked 
page I also attached two RADOS benchmarks. The first benchmark was 
performed when the cluster was initially configured, and the second is 
the same benchmark after the update to Octopus. When comparing these 
two, it is clear that the performance has changed dramatically. For 
example, in the write test the bandwidth is reduced from 320 MB/s to 21 
MB/s and the number of IOPS has also dropped significantly.

I temporarily tried to disable the firewall and SELinux on all nodes to 
see if it made any difference, but it didnt look like it (I did not 
restart any services during this test, I am not sure if that could be 
necessary).

Any suggestions for finding the root cause of this performance decrease 
would be greatly appreciated.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an 
email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx