Re: Ceph OSD daemon causes network card issues

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've been having an issue since upgrading my cluster to Mimic 6 months ago
(previously installed with Luminous 12.2.1).
All nodes that have the same PCIe network card seem to loose network
connectivity randomly. (frequency ranges from a few days to weeks per host
node)
The affected nodes only have the Intel 82576 LAN Card in common, different
motherboards, installed drives, RAM and even PSUs.
Nodes that have the Intel I350 cards are not affected by the Mimic upgrade.
Each host node has recommended RAM installed and has between 4 and 6 OSDs /
sata hard drives installed.
The cluster operated for over a year (Luminous) without a single issue,
only after the Mimic upgrade did the issues begin with these nodes.
The cluster is only used for CephFS (file storage, low intensity usage) and
makes use of erasure data pool (K=4, M=2).

I've tested many things, different kernel versions, different Ubuntu LTS
releases, re-installation and even CENTOS 7, different releases of Mimic,
different igb drivers.
If I stop the ceph-osd daemons the issue does not occur.  If I swap out the
Intel 82576 card with the Intel I350 the issue is resolved.
I haven't any more ideas other than replacing the cards but I feel the
issue is linked to the ceph-osd daemon and a change in the Mimic release.
Below are the various software versions and drivers I've tried and a log
extract from a node that lost network connectivity. - Any help or
suggestions would be greatly appreciated.

*OS:*                          Ubuntu 16.04 / 18.04 and recently CENTOS 7
*Ceph Version:*        Mimic (currently 13.2.6)
*Network card:*        4-PORT 1GB INTEL 82576 LAN CARD (AOC-SG-I4)
*Driver:              *       igb
*Driver Versions:*     5.3.0-k / 5.3.5.22s / 5.4.0-k
*Network Config:*     2 x bonded (LACP) 1GB nic for public net,   2 x
bonded (LACP) 1GB nic for private net
*Log errors:*
Jun 27 12:10:28 cephnode5 kernel: [497346.638608] igb 0000:03:00.0
enp3s0f0: PCIe link lost, device now detached
Jun 27 12:10:28 cephnode5 kernel: [497346.686752] igb 0000:04:00.1
enp4s0f1: PCIe link lost, device now detached
Jun 27 12:10:29 cephnode5 kernel: [497347.550473] igb 0000:03:00.1
enp3s0f1: PCIe link lost, device now detached
Jun 27 12:10:29 cephnode5 kernel: [497347.646785] igb 0000:04:00.0
enp4s0f0: PCIe link lost, device now detached
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.4.1:6809
osd.16 since back 2019-06
-27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
12:10:23.796726)
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.6.1:6804
osd.20 since back 2019-06
-27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
12:10:23.796726)
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.7.1:6803
osd.25 since back 2019-06
-27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
12:10:23.796726)
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.8.1:6803
osd.30 since back 2019-06
-27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
12:10:23.796726)
Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from 10.100.9.1:6808
osd.43 since back 2019-06
-27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
12:10:23.796726)

Paste your `ethtool -S <interface>`, `ethtool -i <interface>` and `dmesg -TL | grep igb`.



k

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux