Hi,
Intel 82576 is.... bad. I've seen quite a few problems with these older igb familiy NICs, but losing the PCIe link is a new one.
I usually see them getting stuck with a message like "tx queue X hung, resetting device..."
Try to disable offloading features using ethtool, that sometimes helps with the problems that I've seen. Maybe that's just a variant of the stuck problem?
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Thu, Jul 18, 2019 at 12:47 PM Geoffrey Rhodes <geoffrey@xxxxxxxxxxxxx> wrote:
Hi Cephers,_______________________________________________I've been having an issue since upgrading my cluster to Mimic 6 months ago (previously installed with Luminous 12.2.1).All nodes that have the same PCIe network card seem to loose network connectivity randomly. (frequency ranges from a few days to weeks per host node)The affected nodes only have the Intel 82576 LAN Card in common, different motherboards, installed drives, RAM and even PSUs.Nodes that have the Intel I350 cards are not affected by the Mimic upgrade.Each host node has recommended RAM installed and has between 4 and 6 OSDs / sata hard drives installed.The cluster operated for over a year (Luminous) without a single issue, only after the Mimic upgrade did the issues begin with these nodes.The cluster is only used for CephFS (file storage, low intensity usage) and makes use of erasure data pool (K=4, M=2).I've tested many things, different kernel versions, different Ubuntu LTS releases, re-installation and even CENTOS 7, different releases of Mimic, different igb drivers.If I stop the ceph-osd daemons the issue does not occur. If I swap out the Intel 82576 card with the Intel I350 the issue is resolved.I haven't any more ideas other than replacing the cards but I feel the issue is linked to the ceph-osd daemon and a change in the Mimic release.Below are the various software versions and drivers I've tried and a log extract from a node that lost network connectivity. - Any help or suggestions would be greatly appreciated.OS: Ubuntu 16.04 / 18.04 and recently CENTOS 7
Ceph Version: Mimic (currently 13.2.6)
Network card: 4-PORT 1GB INTEL 82576 LAN CARD (AOC-SG-I4)Driver: igbDriver Versions: 5.3.0-k / 5.3.5.22s / 5.4.0-kNetwork Config: 2 x bonded (LACP) 1GB nic for public net, 2 x bonded (LACP) 1GB nic for private netLog errors:Jun 27 12:10:28 cephnode5 kernel: [497346.638608] igb 0000:03:00.0 enp3s0f0: PCIe link lost, device now detached
Jun 27 12:10:28 cephnode5 kernel: [497346.686752] igb 0000:04:00.1 enp4s0f1: PCIe link lost, device now detached
Jun 27 12:10:29 cephnode5 kernel: [497347.550473] igb 0000:03:00.1 enp3s0f1: PCIe link lost, device now detached
Jun 27 12:10:29 cephnode5 kernel: [497347.646785] igb 0000:04:00.0 enp4s0f0: PCIe link lost, device now detached
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.4.1:6809 osd.16 since back 2019-06
-27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726)
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.6.1:6804 osd.20 since back 2019-06
-27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726)
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.7.1:6803 osd.25 since back 2019-06
-27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726)
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.8.1:6803 osd.30 since back 2019-06
-27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726)
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.9.1:6808 osd.43 since back 2019-06
-27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726)Kind regards
Geoffrey Rhodes
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com