On 02/21/2017 11:47 AM, Johnny Hughes wrote: > On 01/23/2017 11:04 AM, Kevin Stange wrote: >> I have three different types of CentOS 6 Xen 4.4 based hypervisors (by >> hardware) that are experiencing stability issues which I haven't been >> able to track down. All three types seem to be having issues with NIC >> and/or PCIe. In most cases, the issues are unrecoverable and require a >> hard boot to resolve. All have Intel NICs. >> >> Often the systems will remain stable for days or weeks, then suddenly >> encounter one of these issues. I have yet to tie the error to any >> specific action on the systems and can't reproduce it reliably. >> >> - Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs >> >> Kernel messages upon failure: >> >> pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018 >> pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, >> type=Transaction Layer, id=0018(Receiver ID) >> pcieport 0000:00:03.0: device [8086:340a] error >> status/mask=00002000/00001001 >> pcieport 0000:00:03.0: [13] Advisory Non-Fatal >> pcieport 0000:00:03.0: Error of this Agent(0018) is reported first >> igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical >> Layer, id=0400(Receiver ID) >> igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000 >> igb 0000:04:00.0: [ 0] Receiver Error (First) >> igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical >> Layer, id=0401(Receiver ID) >> igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000 >> igb 0000:04:00.1: [ 0] Receiver Error (First) >> >> This spams to the console continuously until hard booting. >> >> - Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB >> >> igb 0000:82:00.0: Detected Tx Unit Hang >> Tx Queue <1> >> TDH <43> >> TDT <50> >> next_to_use <50> >> next_to_clean <43> >> buffer_info[next_to_clean] >> time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440> >> jiffies <12e6bc8dc> >> desc.status <1c8210> >> >> This spams to the console continuously until hard booting. >> >> - Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB >> >> e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang: >> TDH <ff> >> TDT <33> >> next_to_use <33> >> next_to_clean <fd> >> buffer_info[next_to_clean]: >> time_stamp <138230862> >> next_to_watch <ff> >> jiffies <138231ac0> >> next_to_watch.status <0> >> MAC Status <80383> >> PHY Status <792d> >> PHY 1000BASE-T Status <3c00> >> PHY Extended Status <3000> >> PCI Status <10> >> >> This type of system, the NIC automatically recovers and I don't need to >> reboot. >> >> So far I tried using pcie_aspm=off to see that would help, but it >> appears that the 3.18 kernel turns off ASPM by default on these due to >> probing the BIOS. Stability issues were not resolved by the changes. >> >> On the latter system type I also turned off all offloading setting. It >> appears the stability increased slightly but it didn't fully resolve the >> problem. I haven't adjusted offload settings on the first two server >> types yet. >> >> I suspect this problem is related to the 3.18 kernel used by the virt >> SIG, as we had these running Xen on CentOS 5's kernel with no issues for >> years, and systems of these types used elsewhere in our facility are >> stable under CentOS 6's standard kernel. This affects more than one >> server of each type, so I don't believe it is a hardware failure, or >> else it's a hardware design flaw. >> >> Has anyone experienced similar issues with this configuration, and if >> so, does anyone have tips on how to resolve the issues? >> > > > Kevin, > > Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along > with the newer linux-firmare packages and xfsprogs). > > If you enable the xen-testing repository in your CentOS-Xen.repo file > (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade' > should replace all the needed packages. > > The actual path is here for the packages: > > https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/ > > Hopefully this helps. > I should have said .. 'just releaed for testing' :) I have been using this for 4 or 5 days with no issues in production, but it needs testing before final release :)
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ CentOS-virt mailing list CentOS-virt@xxxxxxxxxx https://lists.centos.org/mailman/listinfo/centos-virt