Re: Slow requests triggered by a single node

Nico Schottelius <nico.schottelius@xxxxxxxxxxx> · Tue, 13 Jul 2021 13:36:13 +0200

HC,

we have seen a very similar problem some months ago on Nautilus, where
our cluster had multiple ours slow client IO. The "solution" was to
re-re-re-start most components. As we often had several OSDs pointed out
to be slow, restarting slow OSD to slow OSD *seemed* to help, however
later restarting the monitors also helped to clean up the situation.

Overall there is no clear pattern of failure, ours started with the
outage of a single node, not with an upgrade.

HTH nonetheless,

Nico

Cloud Tech <cloudtechtr@xxxxxxxxx> writes:

> Dear Cephers,
>
> I have a Ceph cluster with 16 nodes, 335 OSDs all running Octopus 15.2.13
> now. During upgrading from Nautilus last week, a problem was triggered by a
> specific node (Ceph09) and several slow requests were seen after upgrading
> Ceph09. The first 8 nodes have identical hardware and completed the upgrade
> process without problem. After the problem, client IO nearly stopped and
> rebooting this Ceph09 node did not help to solve the problem. The only
> thing that helped was rebooting monitor nodes one by one to get rid of
> these slow requests.
>
> We have seen lots of "fault initiating reconnect" messages like below from
> OSD logs of Ceph09 node.
>
> 2021-07-10T13:11:12.333+0300 7f90a5168700  0 --1- [v2:
> 10.30.3.139:6800/90655,v1:10.30.3.139:6801/90655] >> v1:
> 10.30.3.132:6833/286861 conn(0x561748e62c00 0x561768cc6800 :-1 s=OPENED
> pgs=2366 cs=245 l=0).fault initiating reconnect
>
> We have completed the upgrade process for other Ceph nodes without problem
> and all nodes are running Octopus 15.2.13 now. But when we restart OSDs of
> Ceph09 or reboot the node, the same problem occurs immediately. Any
> operation on the remaining nodes including rebooting node, restarting OSDS
> does not trigger the problem. Interestingly, we started to see the "fault
> initiating reconnect" messages for other OSDs running on different nodes
> after the problem.
>
> To investigate the problem, we tried to reweight all OSDs on Ceph09 to 0,
> the same problem has occurred again and we had slow requests and
> performance problems on client IO. To fix the problem, restarting all ceph
> daemons on monitors did not help. We rebooted monitor nodes one by one
> several times without luck. Finally, rebooting cephmonitor01 twice fixed
> the issue.
>
> We have checked all the network settings including MTUs and everything
> seems fine. iperf3 tests between any nodes in the cluster provides expected
> results. dmesg and syslog messages do not include any critical message
> about disks.
>
> At the time being, any operation on Ceph09 triggers the problem and we
> could not find a solution to fix the problem.
>
> Does anyone have any idea about this problem or any advice to trace the
> problem?
>
> Any advice and suggestions would be greatly appreciated.
>
> Best regards,
> HC
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Sustainable and modern Infrastructures by ungleich.ch
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx