I've been having an issue since upgrading my cluster to Mimic 6 months ago (previously installed with Luminous 12.2.1). All nodes that have the same PCIe network card seem to loose network connectivity randomly. (frequency ranges from a few days to weeks per host node) The affected nodes only have the Intel 82576 LAN Card in common, different motherboards, installed drives, RAM and even PSUs. Nodes that have the Intel I350 cards are not affected by the Mimic upgrade. Each host node has recommended RAM installed and has between 4 and 6 OSDs / sata hard drives installed. The cluster operated for over a year (Luminous) without a single issue, only after the Mimic upgrade did the issues begin with these nodes. The cluster is only used for CephFS (file storage, low intensity usage) and makes use of erasure data pool (K=4, M=2). I've tested many things, different kernel versions, different Ubuntu LTS releases, re-installation and even CENTOS 7, different releases of Mimic, different igb drivers. If I stop the ceph-osd daemons the issue does not occur. If I swap out the Intel 82576 card with the Intel I350 the issue is resolved. I haven't any more ideas other than replacing the cards but I feel the issue is linked to the ceph-osd daemon and a change in the Mimic release. Below are the various software versions and drivers I've tried and a log extract from a node that lost network connectivity. - Any help or suggestions would be greatly appreciated. *OS:* Ubuntu 16.04 / 18.04 and recently CENTOS 7 *Ceph Version:* Mimic (currently 13.2.6) *Network card:* 4-PORT 1GB INTEL 82576 LAN CARD (AOC-SG-I4) *Driver: * igb *Driver Versions:* 5.3.0-k / 5.3.5.22s / 5.4.0-k *Network Config:* 2 x bonded (LACP) 1GB nic for public net, 2 x bonded (LACP) 1GB nic for private net *Log errors:* Jun 27 12:10:28 cephnode5 kernel: [497346.638608] igb 0000:03:00.0 enp3s0f0: PCIe link lost, device now detached Jun 27 12:10:28 cephnode5 kernel: [497346.686752] igb 0000:04:00.1 enp4s0f1: PCIe link lost, device now detached Jun 27 12:10:29 cephnode5 kernel: [497347.550473] igb 0000:03:00.1 enp3s0f1: PCIe link lost, device now detached Jun 27 12:10:29 cephnode5 kernel: [497347.646785] igb 0000:04:00.0 enp4s0f0: PCIe link lost, device now detached Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.4.1:6809 osd.16 since back 2019-06 -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726) Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.6.1:6804 osd.20 since back 2019-06 -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726) Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.7.1:6803 osd.25 since back 2019-06 -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726) Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.8.1:6803 osd.30 since back 2019-06 -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726) Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.9.1:6808 osd.43 since back 2019-06 -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27 12:10:23.796726) Paste your `ethtool -S <interface>`, `ethtool -i <interface>` and `dmesg -TL | grep igb`.
k |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com