On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote: > Hi, on kernel based on 5.4.2 we are observing a deadlock between > reset_lock semaphore and device_lock (dev->mutex). The scenario > we do is putting the system to sleep, disconnecting the eGPU > from the PCIe bus (through a special SBIOS setting) or by simply > removing power to external PCIe cage and waking the > system up. > > I attached the log. Please advise if you have any idea how > to work around it ? Since the kernel is old, does anyone > have an idea if this issue is known and already solved in later kernels ? > We cannot try with latest since our kernel is custom for that platform. It is a known issue. Here's a fix I submitted during the v5.9 cycle: https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@xxxxxxxxx/ The fix hasn't been applied yet. I think I need to rework the patch, just haven't found the time. Since the trigger in your case are AER-handled errors during a system sleep transition, you may also want to consider the following 2-patch series by Kai-Heng Feng which is currently under discussion: https://lore.kernel.org/linux-pci/20220127025418.1989642-1-kai.heng.feng@xxxxxxxxxxxxx/ That series disables AER during a system sleep transition and should thus prevent the flood of AER-handled errors you're seeing. Once AER is disabled, the reset-induced deadlocks should go away as well. Thanks, Lukas