Re: pci_bus_read_config constantly took 1.3 seconds

Keith Busch <kbusch@xxxxxxxxxx> · Thu, 28 Nov 2019 08:17:52 +0900

On Tue, Nov 26, 2019 at 04:46:25PM -0800, Kexin Chen wrote:
> I'm Kexin. I'm working on Linux nvme system. Some of my test triggered
> PCI AER uncorrectable errors leading to slow pci_bus_read_config_XXX,
> which took 1.3 seconds for every access. This caused a lot of CPU
> scheduling issues, for example, 'Thread not rescheduled for xxx ms
> after irq xxx' or 'Softirq x took xxx ms', and finally kernel reboot
> due to soft lockup. Definitely there's hardware issue, but could
> kernel take some actions to avoid kernel from crashing and exit this
> gracefully ? My current system is using 4.4.182.

Unless the pci layer is reading some config space that it really should
know not to access, there really isn't anything the kernel can do here
if we're really waiting on hardware to complete the transaction. The
hardware just has to function correctly.

There are some types of AERs that do indicate the kernel may avoid
accessing some config space, and it's been improved since 4.4 For example,
we don't try reading upstream ports that are the source of an ERR_FATAL
because the link can't be considered reliable. You may want to try a
more recent stable to see if any of those improvements apply to your case.