HC, we have seen a very similar problem some months ago on Nautilus, where our cluster had multiple ours slow client IO. The "solution" was to re-re-re-start most components. As we often had several OSDs pointed out to be slow, restarting slow OSD to slow OSD *seemed* to help, however later restarting the monitors also helped to clean up the situation. Overall there is no clear pattern of failure, ours started with the outage of a single node, not with an upgrade. HTH nonetheless, Nico Cloud Tech <cloudtechtr@xxxxxxxxx> writes: > Dear Cephers, > > I have a Ceph cluster with 16 nodes, 335 OSDs all running Octopus 15.2.13 > now. During upgrading from Nautilus last week, a problem was triggered by a > specific node (Ceph09) and several slow requests were seen after upgrading > Ceph09. The first 8 nodes have identical hardware and completed the upgrade > process without problem. After the problem, client IO nearly stopped and > rebooting this Ceph09 node did not help to solve the problem. The only > thing that helped was rebooting monitor nodes one by one to get rid of > these slow requests. > > We have seen lots of "fault initiating reconnect" messages like below from > OSD logs of Ceph09 node. > > 2021-07-10T13:11:12.333+0300 7f90a5168700 0 --1- [v2: > 10.30.3.139:6800/90655,v1:10.30.3.139:6801/90655] >> v1: > 10.30.3.132:6833/286861 conn(0x561748e62c00 0x561768cc6800 :-1 s=OPENED > pgs=2366 cs=245 l=0).fault initiating reconnect > > We have completed the upgrade process for other Ceph nodes without problem > and all nodes are running Octopus 15.2.13 now. But when we restart OSDs of > Ceph09 or reboot the node, the same problem occurs immediately. Any > operation on the remaining nodes including rebooting node, restarting OSDS > does not trigger the problem. Interestingly, we started to see the "fault > initiating reconnect" messages for other OSDs running on different nodes > after the problem. > > To investigate the problem, we tried to reweight all OSDs on Ceph09 to 0, > the same problem has occurred again and we had slow requests and > performance problems on client IO. To fix the problem, restarting all ceph > daemons on monitors did not help. We rebooted monitor nodes one by one > several times without luck. Finally, rebooting cephmonitor01 twice fixed > the issue. > > We have checked all the network settings including MTUs and everything > seems fine. iperf3 tests between any nodes in the cluster provides expected > results. dmesg and syslog messages do not include any critical message > about disks. > > At the time being, any operation on Ceph09 triggers the problem and we > could not find a solution to fix the problem. > > Does anyone have any idea about this problem or any advice to trace the > problem? > > Any advice and suggestions would be greatly appreciated. > > Best regards, > HC > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx -- Sustainable and modern Infrastructures by ungleich.ch _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx