On Tue, Jul 02, 2019 at 04:25:59PM +0800, Kai Heng Feng wrote: > +linux-pci > > Hi Sasha, > > at 6:49 PM, Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx> wrote: > > > at 14:26, Neftin, Sasha <sasha.neftin@xxxxxxxxx> wrote: > > > > > On 6/26/2019 09:14, Kai Heng Feng wrote: > > > > Hi Sasha > > > > at 5:09 PM, Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx> wrote: > > > > > Hi Jeffrey, > > > > > > > > > > We’ve encountered another issue, which causes multiple CRC > > > > > errors and renders ethernet completely useless, here’s the > > > > > network stats: > > > > I also tried ignore_ltr for this issue, seems like it alleviates > > > > the symptom a bit for a while, then the network still becomes > > > > useless after some usage. > > > > And yes, it’s also a Whiskey Lake platform. What’s the next step > > > > to debug this problem? > > > > Kai-Heng > > > CRC errors not related to the LTR. Please, try to disable the ME on > > > your platform. Hope you have this option in BIOS. Another way is to > > > contact your PC vendor and ask to provide NVM without ME. Let's > > > start debugging with these steps. > > > > According to ODM, the ME can be physically disabled by a jumper. > > But after disabling the ME the same issue can still be observed. > > We’ve found that this issue doesn’t happen to SATA SSD, it only happens when > NVMe SSD is in use. > > Here are the steps: > - Disable NVMe ASPM, issue persists > - modprobe -r e1000e && modprobe e1000e, issue doesn’t happen > - Enabling NVMe ASPM, issue doesn’t happen > > As long as NVMe ASPM gets enabled after e1000e gets loaded, the issue > doesn’t happen. IIUC the problem happens with the mainline and dev-queue e1000e driver, but not with the out-of-tree Intel driver. Since there is a working driver and there's the potential (at least in principle) for unifying them or bisecting between them, I have limited interest in debugging it from scratch. If it turns out to be a PCI core problem, I would want to know: What's the PCI topology? "lspci -vv" output for the system? Does it make a difference if you boot with "pcie_aspm=off"? Collect complete dmesg, maybe attach it to a kernel.org bugzilla? > > > > > /sys/class/net/eno1/statistics$ grep . * > > > > > collisions:0 > > > > > multicast:95 > > > > > rx_bytes:1499851 > > > > > rx_compressed:0 > > > > > rx_crc_errors:1165 > > > > > rx_dropped:0 > > > > > rx_errors:2330 > > > > > rx_fifo_errors:0 > > > > > rx_frame_errors:0 > > > > > rx_length_errors:0 > > > > > rx_missed_errors:0 > > > > > rx_nohandler:0 > > > > > rx_over_errors:0 > > > > > rx_packets:4789 > > > > > tx_aborted_errors:0 > > > > > tx_bytes:864312 > > > > > tx_carrier_errors:0 > > > > > tx_compressed:0 > > > > > tx_dropped:0 > > > > > tx_errors:0 > > > > > tx_fifo_errors:0 > > > > > tx_heartbeat_errors:0 > > > > > tx_packets:7370 > > > > > tx_window_errors:0 > > > > > > > > > > Same behavior can be observed on both mainline kernel and on > > > > > your dev-queue branch. > > > > > OTOH, the same issue can’t be observed on out-of-tree e1000e. > > > > > > > > > > Is there any plan to close the gap between upstream and > > > > > out-of-tree version? > > > > > > > > > > Kai-Heng > >