pci_bus_read_config constantly took 1.3 seconds

Kexin Chen <kexinchen7@xxxxxxxxx> · Tue, 26 Nov 2019 16:46:25 -0800

Hi,

I'm Kexin. I'm working on Linux nvme system. Some of my test triggered
PCI AER uncorrectable errors leading to slow pci_bus_read_config_XXX,
which took 1.3 seconds for every access. This caused a lot of CPU
scheduling issues, for example, 'Thread not rescheduled for xxx ms
after irq xxx' or 'Softirq x took xxx ms', and finally kernel reboot
due to soft lockup. Definitely there's hardware issue, but could
kernel take some actions to avoid kernel from crashing and exit this
gracefully ? My current system is using 4.4.182.

Thanks,
Kexin