On Wed, Jan 06, 2021 at 09:28:23PM +0100, Samuel Thibault wrote: > Samuel Thibault, le lun. 04 janv. 2021 22:36:48 +0100, a ecrit: > > Samuel Thibault, le lun. 04 janv. 2021 21:12:47 +0100, a ecrit: > > > Vidya Sagar wrote: > > > > Since this is a laptop, I'm suspecting that ASPM states might have > > > > been enabled which could be causing these errors. > > > > > > Keith Busch, le lun. 04 janv. 2021 10:44:35 -0800, a ecrit: > > > > Sometimes these types of errors occur from low power settings, so you > > > > can try disabling the automatic management of these (assuming the > > > > hardware supports it). To disable nvme specific power state transitions, > > > > the kernel parameter is "nvme_core.default_ps_max_latency_us=0". > > > > > > I have tried to add it, > > > > > > I'll watch in the coming > > > hours/days to see if that avoided the issue. > > > > I did get one > > > > Jan 4 22:34:53 begin kernel: [ 7165.207562] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:02:00.0 > > Jan 4 22:34:53 begin kernel: [ 7165.213891] nvme 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID) > > Jan 4 22:34:53 begin kernel: [ 7165.216949] nvme 0000:02:00.0: device [15b7:5006] error status/mask=00000001/0000e000 > > Jan 4 22:34:53 begin kernel: [ 7165.219995] nvme 0000:02:00.0: [ 0] RxErr > > > > > > PCI also has automatic link power savings that you can disable with > > > > parameter "pcie_aspm=off". > > > > > > I'll try that if I still see errors with the nvme_core parameter. > > > > I'm on it. > > I tried to make the machine only run apt-get update every 10m for 24h. > > With pcie_aspm=off, I didn't get any corrected error > Without it I got 39 corrected errors > > So that seems very relevant :) > > Is there more I can provide to investigate if that can somehow be fixed > in the driver? I guess I can safely use the system with pcie_aspm=off? > (the energy saving seems neglectible) I don't think there's more to do from the kernel or driver beyond disabling usage of the problematic feature. I think a proper fix would have to come from the hardware vendor.