On 01/23/2017 11:04 AM, Kevin Stange wrote: > I have three different types of CentOS 6 Xen 4.4 based hypervisors (by > hardware) that are experiencing stability issues which I haven't been > able to track down. All three types seem to be having issues with NIC > and/or PCIe. In most cases, the issues are unrecoverable and require a > hard boot to resolve. All have Intel NICs. > > Often the systems will remain stable for days or weeks, then suddenly > encounter one of these issues. I have yet to tie the error to any > specific action on the systems and can't reproduce it reliably. > > - Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs > > Kernel messages upon failure: > > pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018 > pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, > type=Transaction Layer, id=0018(Receiver ID) > pcieport 0000:00:03.0: device [8086:340a] error > status/mask=00002000/00001001 > pcieport 0000:00:03.0: [13] Advisory Non-Fatal > pcieport 0000:00:03.0: Error of this Agent(0018) is reported first > igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical > Layer, id=0400(Receiver ID) > igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000 > igb 0000:04:00.0: [ 0] Receiver Error (First) > igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical > Layer, id=0401(Receiver ID) > igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000 > igb 0000:04:00.1: [ 0] Receiver Error (First) > > This spams to the console continuously until hard booting. > > - Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB > > igb 0000:82:00.0: Detected Tx Unit Hang > Tx Queue <1> > TDH <43> > TDT <50> > next_to_use <50> > next_to_clean <43> > buffer_info[next_to_clean] > time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440> > jiffies <12e6bc8dc> > desc.status <1c8210> > > This spams to the console continuously until hard booting. > > - Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB > > e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang: > TDH <ff> > TDT <33> > next_to_use <33> > next_to_clean <fd> > buffer_info[next_to_clean]: > time_stamp <138230862> > next_to_watch <ff> > jiffies <138231ac0> > next_to_watch.status <0> > MAC Status <80383> > PHY Status <792d> > PHY 1000BASE-T Status <3c00> > PHY Extended Status <3000> > PCI Status <10> > > This type of system, the NIC automatically recovers and I don't need to > reboot. > > So far I tried using pcie_aspm=off to see that would help, but it > appears that the 3.18 kernel turns off ASPM by default on these due to > probing the BIOS. Stability issues were not resolved by the changes. > > On the latter system type I also turned off all offloading setting. It > appears the stability increased slightly but it didn't fully resolve the > problem. I haven't adjusted offload settings on the first two server > types yet. > > I suspect this problem is related to the 3.18 kernel used by the virt > SIG, as we had these running Xen on CentOS 5's kernel with no issues for > years, and systems of these types used elsewhere in our facility are > stable under CentOS 6's standard kernel. This affects more than one > server of each type, so I don't believe it is a hardware failure, or > else it's a hardware design flaw. > > Has anyone experienced similar issues with this configuration, and if > so, does anyone have tips on how to resolve the issues? > Kevin, Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along with the newer linux-firmare packages and xfsprogs). If you enable the xen-testing repository in your CentOS-Xen.repo file (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade' should replace all the needed packages. The actual path is here for the packages: https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/ Hopefully this helps. Thanks, Johnny Hughes
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ CentOS-virt mailing list CentOS-virt@xxxxxxxxxx https://lists.centos.org/mailman/listinfo/centos-virt