Thanks a lot for quick response, we will give this a try.
Andrey
On 2022-02-10 01:23, Lukas Wunner wrote:
On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
Hi, on kernel based on 5.4.2 we are observing a deadlock between
reset_lock semaphore and device_lock (dev->mutex). The scenario
we do is putting the system to sleep, disconnecting the eGPU
from the PCIe bus (through a special SBIOS setting) or by simply
removing power to external PCIe cage and waking the
system up.
I attached the log. Please advise if you have any idea how
to work around it ? Since the kernel is old, does anyone
have an idea if this issue is known and already solved in later kernels ?
We cannot try with latest since our kernel is custom for that platform.
It is a known issue. Here's a fix I submitted during the v5.9 cycle:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=hrRVL77%2FNRvojfG2WDamDLO5dsqn3Cv6XxNbP0eGum0%3D&reserved=0
The fix hasn't been applied yet. I think I need to rework the patch,
just haven't found the time.
Since the trigger in your case are AER-handled errors during a
system sleep transition, you may also want to consider the
following 2-patch series by Kai-Heng Feng which is currently
under discussion:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=tnLUa6J%2FLqFrlm4CfZ9l26io0bOQ7ip30d26ax05st4%3D&reserved=0
That series disables AER during a system sleep transition and
should thus prevent the flood of AER-handled errors you're seeing.
Once AER is disabled, the reset-induced deadlocks should go away as well.
Thanks,
Lukas