On Fri, Aug 05, 2016 at 12:15:53PM -0600, Daniel Drake wrote: > Hi Alexander, > > Reviving an old topic here... > > We are seeing this "problem" on an increasing number of units from the > vendor, and searching around it can also be seen on Dell and HP > products. Always with the same Realtek b723 wifi device. e.g. > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173 > > The amount of error spam is problematic in that it slows down boot > really significantly, while printing lots of scary messages for the > user. > We tried doing a PCI MSI blacklist for affected laptops but we are > struggling to keep that blacklist updated with the increasing number > of affected models. > > Enough hacks, I am wondering what we can do to solve this problem in > the mainline kernel... I think this is a bug in AER: https://bugzilla.kernel.org/show_bug.cgi?id=109691 I think I understand the problem, but I haven't had time to fix it. The bugzilla has a pointer to more details, and it would be awesome if somebody would jump in. > On Thu, Sep 3, 2015 at 12:05 PM, Alexander Duyck > <alexander.duyck@xxxxxxxxx> wrote: > > On 09/03/2015 06:32 AM, Daniel Drake wrote: > >> > >> On Wed, Sep 2, 2015 at 7:57 PM, Alexander Duyck > >> <alexander.duyck@xxxxxxxxx> wrote: > >>> > >>> Since it is correctable errors it is likely some sort of signalling > >>> issue. > >>> Could we get the output of something like an lspci -vt? Then you would be > >>> able to tell what the device is on the other side of the link from > >>> 00:1c.5 > >>> and then we could probably check to see if there has been any changes for > >>> the device driver on the other end of the link. > >> > >> "lspci -vt" reliably causes one occurance of the message, which is > >> logged by the kernel before lspci has written anything to stdout. > >> pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5 > >> pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, > >> type=Physical Layer, id=00e5(Receiver ID) > >> pcieport 0000:00:1c.5: device [8086:9d15] error > >> status/mask=00000001/00002000 > >> pcieport 0000:00:1c.5: [ 0] Receiver Error > >> > >> -[0000:00]-+-00.0 Intel Corporation Device 1904 > >> +-02.0 Intel Corporation Device 1916 > >> +-04.0 Intel Corporation Device 1903 > >> +-08.0 Intel Corporation Device 1911 > >> +-14.0 Intel Corporation Device 9d2f > >> +-14.2 Intel Corporation Device 9d31 > >> +-15.0 Intel Corporation Device 9d60 > >> +-15.1 Intel Corporation Device 9d61 > >> +-16.0 Intel Corporation Device 9d3a > >> +-17.0 Intel Corporation Device 9d03 > >> +-1c.0-[01]-- > >> +-1c.4-[02]----00.0 Realtek Semiconductor Co., Ltd. > >> RTL8111/8168 PCI Express Gigabit Ethernet controller > >> +-1c.5-[03]----00.0 Realtek Semiconductor Co., Ltd. Device > >> b723 > >> +-1f.0 Intel Corporation Device 9d48 > >> +-1f.2 Intel Corporation Device 9d21 > >> +-1f.3 Intel Corporation Device 9d70 > >> \-1f.4 Intel Corporation Device 9d23 > >> > >> Does this mean these messages are somehow related to the Realtek b723 > >> device? That is the wifi card. > >> Using x86_64_defconfig there is not even any driver loaded for this > >> device, yet the messages appear quite a bit. > >> If I use a full config with all the relevant drivers including > >> rtlwifi, the frequency of these messages goes up a lot though. > > > > > > The correctable errors are likely a result of some sort of link error > > between the root port 00:1c.5 and the wireless adapter at 3:00.0. What is > > likely happening is that when the device is unused it transitions down to a > > lower power link state like L0s or L1, and when it comes out of that state > > it is likely triggering the PCIe error most likely as a result of something > > during the PCIe link training sequence. > > > > You might want to notify the manufacturer of the laptop as they may need to > > address an issue in their hardware, firmware, or possibly add a workaround > > to mask off Receiver Error reporting for their part via either a PCIe quirk > > or driver fix. > > > >>> My suspicion since this is a laptop is that something like a power > >>> management change might be responsible if this is a regression as I have > >>> seen messages like this pop up as a result of ASPM being enabled before. > >> > >> It's likely not a regression, this is brand new hardware and this > >> message is seen on all kernels that we have tried (4.1, 4.2, master). > >> pcie_aspm=off also makes these messages go away. > > > > > > Correctable errors are considered a sign of the PCIe link health. In theory > > they can be ignored since by definition they can be corrected by the > > hardware. One thing you could do if you aren't using the wireless card > > would be to simply switch off the correctable error reporting by setting the > > mask bit for it in configuration space using setpci. > > > > To do that what you could do is find the offset for the PCIe AER > > configuration register for your port by doing a "lspci -vvv -s 0:1c.5" and > > what you should get will be a dump listing the capabilities and their > > current settings. In there you should find a line like: > > Capabilities: [148 v1] Advanced Error Reporting > > > > The 148 is the hex offset of the configuration space. The Correctable Error > > mask is located at a hex offset of 0x14 from there. So adding the hex > > values 0x148 and 0x14 gives us 0x15C. To disable reporting correctable > > receiver errors you would just want to add a 1 to whatever value you get > > from "setpci -s 0:1c.5 0x15C.l" and then write that value back. So for > > example on my system I ended up with something like "setpci -s 0:1c.5 > > 0x15C.l=2001" where the output from the first command was 2000. > > I guess this is the most concrete suggestion for how to avoid the > issue - perhaps we can do that in rtl8723be driver probe. However, you > mentioned above that we should only do it if we aren't using the > wireless card. In this case we are using it... should we look for > another approach instead? > > Thanks > Daniel > -- > To unsubscribe from this list: send the line "unsubscribe linux-pci" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html