Slow requests triggered by a single node during upgrade

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Cephers, I have asked this question to users list, there are similar problems reported, but I did not get a reply yet. I have a Ceph cluster with 16 nodes, 335 OSDs all running Octopus 15.2.13 now. During upgrading from Nautilus last week, a problem was triggered by a specific node (Ceph09) and several slow requests were seen after upgrading Ceph09. The first 8 nodes have identical hardware and completed the upgrade process without problem. After the problem, client IO nearly stopped and rebooting this Ceph09 node did not help to solve the problem. The only thing that helped was rebooting monitor nodes one by one to get rid of these slow requests. We have seen lots of "fault initiating reconnect" messages like below from OSD logs of Ceph09 node. 2021-07-10T13:11:12.333+0300 7f90a5168700 0 --1- [v2: 10.30.3.139:6800/90655,v1:10.30.3.139:6801/90655] >> v1: 10.30.3.132:6833/286861 conn(0x561748e62c00 0x561768cc6800 :-1 s=OPENED pgs=2366 cs=245 l=0).fault initiating reconnect We have completed the upgrade process for other Ceph nodes without problem and all nodes are running Octopus 15.2.13 now. But when we restart OSDs of Ceph09 or reboot the node, the same problem occurs immediately. Any operation on the remaining nodes including rebooting node, restarting OSDS does not trigger the problem. Interestingly, we started to see the "fault initiating reconnect" messages for other OSDs running on different nodes after the problem. To investigate the problem, we tried to reweight all OSDs on Ceph09 to 0, the same problem has occurred again and we had slow requests and performance problems on client IO. To fix the problem, restarting all ceph daemons on monitors did not help. We rebooted monitor nodes one by one several times without luck. Finally, rebooting cephmonitor01 twice fixed the issue. We have checked all the network settings including MTUs and everything seems fine. iperf3 tests between any nodes in the cluster provides expected results. dmesg and syslog messages do not include any critical message about disks. At the time being, any operation on Ceph09 triggers the problem and we could not find a solution to fix the problem. Does anyone have any idea about this problem or any advice to trace the problem? Any advice and suggestions would be greatly appreciated. Best regards, HC
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux