Re: Slow requests triggered by a single node

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Nico,

How did you solve this problem persistently? Every time we restart OSDs on
this specific node or reboot the node itself, the problem occurs
immediately. We can not plan maintenance on this node to avoid slow
requests. After unsetting the noout flag, we tried to power off the node to
let Ceph auto-heal but the problem arose again.

Thanks in advance,
HC

Nico Schottelius <nico.schottelius@xxxxxxxxxxx>, 13 Tem 2021 Sal, 14:35
tarihinde şunu yazdı:

>
> HC,
>
> we have seen a very similar problem some months ago on Nautilus, where
> our cluster had multiple ours slow client IO. The "solution" was to
> re-re-re-start most components. As we often had several OSDs pointed out
> to be slow, restarting slow OSD to slow OSD *seemed* to help, however
> later restarting the monitors also helped to clean up the situation.
>
> Overall there is no clear pattern of failure, ours started with the
> outage of a single node, not with an upgrade.
>
> HTH nonetheless,
>
> Nico
>
>
> Cloud Tech <cloudtechtr@xxxxxxxxx> writes:
>
> > Dear Cephers,
> >
> > I have a Ceph cluster with 16 nodes, 335 OSDs all running Octopus 15.2.13
> > now. During upgrading from Nautilus last week, a problem was triggered
> by a
> > specific node (Ceph09) and several slow requests were seen after
> upgrading
> > Ceph09. The first 8 nodes have identical hardware and completed the
> upgrade
> > process without problem. After the problem, client IO nearly stopped and
> > rebooting this Ceph09 node did not help to solve the problem. The only
> > thing that helped was rebooting monitor nodes one by one to get rid of
> > these slow requests.
> >
> > We have seen lots of "fault initiating reconnect" messages like below
> from
> > OSD logs of Ceph09 node.
> >
> > 2021-07-10T13:11:12.333+0300 7f90a5168700  0 --1- [v2:
> > 10.30.3.139:6800/90655,v1:10.30.3.139:6801/90655] >> v1:
> > 10.30.3.132:6833/286861 conn(0x561748e62c00 0x561768cc6800 :-1 s=OPENED
> > pgs=2366 cs=245 l=0).fault initiating reconnect
> >
> > We have completed the upgrade process for other Ceph nodes without
> problem
> > and all nodes are running Octopus 15.2.13 now. But when we restart OSDs
> of
> > Ceph09 or reboot the node, the same problem occurs immediately. Any
> > operation on the remaining nodes including rebooting node, restarting
> OSDS
> > does not trigger the problem. Interestingly, we started to see the "fault
> > initiating reconnect" messages for other OSDs running on different nodes
> > after the problem.
> >
> > To investigate the problem, we tried to reweight all OSDs on Ceph09 to 0,
> > the same problem has occurred again and we had slow requests and
> > performance problems on client IO. To fix the problem, restarting all
> ceph
> > daemons on monitors did not help. We rebooted monitor nodes one by one
> > several times without luck. Finally, rebooting cephmonitor01 twice fixed
> > the issue.
> >
> > We have checked all the network settings including MTUs and everything
> > seems fine. iperf3 tests between any nodes in the cluster provides
> expected
> > results. dmesg and syslog messages do not include any critical message
> > about disks.
> >
> > At the time being, any operation on Ceph09 triggers the problem and we
> > could not find a solution to fix the problem.
> >
> > Does anyone have any idea about this problem or any advice to trace the
> > problem?
> >
> > Any advice and suggestions would be greatly appreciated.
> >
> > Best regards,
> > HC
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
> --
> Sustainable and modern Infrastructures by ungleich.ch
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux