On Tue, Nov 26, 2019 at 04:46:25PM -0800, Kexin Chen wrote: > I'm Kexin. I'm working on Linux nvme system. Some of my test triggered > PCI AER uncorrectable errors leading to slow pci_bus_read_config_XXX, > which took 1.3 seconds for every access. This caused a lot of CPU > scheduling issues, for example, 'Thread not rescheduled for xxx ms > after irq xxx' or 'Softirq x took xxx ms', and finally kernel reboot > due to soft lockup. Definitely there's hardware issue, but could > kernel take some actions to avoid kernel from crashing and exit this > gracefully ? My current system is using 4.4.182. Unless the pci layer is reading some config space that it really should know not to access, there really isn't anything the kernel can do here if we're really waiting on hardware to complete the transaction. The hardware just has to function correctly. There are some types of AERs that do indicate the kernel may avoid accessing some config space, and it's been improved since 4.4 For example, we don't try reading upstream ports that are the source of an ERR_FATAL because the link can't be considered reliable. You may want to try a more recent stable to see if any of those improvements apply to your case.