Slow requests triggered by a single node during upgrade

Cloud Tech <cloudtechtr@xxxxxxxxx> · Thu, 15 Jul 2021 10:32:17 +0300

Dear Cephers,

I have asked this question to users list, there are similar problems reported, but I did not get a reply yet. I have a Ceph cluster with 16 nodes, 335 OSDs all running Octopus 15.2.13 now. During upgrading from Nautilus last week, a problem was triggered by a specific node (Ceph09) and several slow requests were seen after upgrading Ceph09. The first 8 nodes have identical hardware and completed the upgrade process without problem. After the problem, client IO nearly stopped and
rebooting this Ceph09 node did not help to solve the problem. The only thing that helped was rebooting monitor nodes one by one to get rid of these slow requests.

We have seen lots of "fault initiating reconnect" messages like below from
OSD logs of Ceph09 node.

2021-07-10T13:11:12.333+0300 7f90a5168700  0 --1- [v2:
10.30.3.139:6800/90655,v1:10.30.3.139:6801/90655] >> v1:
10.30.3.132:6833/286861 conn(0x561748e62c00 0x561768cc6800 :-1 s=OPENED
pgs=2366 cs=245 l=0).fault initiating reconnect

We have completed the upgrade process for other Ceph nodes without problem
and all nodes are running Octopus 15.2.13 now. But when we restart OSDs of
Ceph09 or reboot the node, the same problem occurs immediately. Any
operation on the remaining nodes including rebooting node, restarting OSDS
does not trigger the problem. Interestingly, we started to see the "fault
initiating reconnect" messages for other OSDs running on different nodes
after the problem.

To investigate the problem, we tried to reweight all OSDs on Ceph09 to 0,
the same problem has occurred again and we had slow requests and
performance problems on client IO. To fix the problem, restarting all ceph
daemons on monitors did not help. We rebooted monitor nodes one by one
several times without luck. Finally, rebooting cephmonitor01 twice fixed
the issue.

We have checked all the network settings including MTUs and everything
seems fine. iperf3 tests between any nodes in the cluster provides expected
results. dmesg and syslog messages do not include any critical message
about disks.

At the time being, any operation on Ceph09 triggers the problem and we
could not find a solution to fix the problem.

Does anyone have any idea about this problem or any advice to trace the
problem?

Any advice and suggestions would be greatly appreciated.

Best regards,
HC

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx