Hi, On 6/14/22 11:07 AM, Andrey Grodzovsky wrote: > Just a gentle ping, also - I updated the ticket https://bugzilla.kernel.org/show_bug.cgi?id=215590 > > with the workaround we did if this could help you to advise us > what would be a generic solution for this ? > > Andrey Can you explain your WA? It seems to be unrelated to deadlock issue discussed in this thread. Are they related? > > On 2022-06-10 17:25, Andrey Grodzovsky wrote: >> >> >> On 2022-02-10 09:39, Andrey Grodzovsky wrote: >>> Thanks a lot for quick response, we will give this a try. >>> >>> Andrey >>> >>> On 2022-02-10 01:23, Lukas Wunner wrote: >>>> On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote: >>>>> Hi, on kernel based on 5.4.2 we are observing a deadlock between >>>>> reset_lock semaphore and device_lock (dev->mutex). The scenario >>>>> we do is putting the system to sleep, disconnecting the eGPU >>>>> from the PCIe bus (through a special SBIOS setting) or by simply >>>>> removing power to external PCIe cage and waking the >>>>> system up. >>>>> >>>>> I attached the log. Please advise if you have any idea how >>>>> to work around it ? Since the kernel is old, does anyone >>>>> have an idea if this issue is known and already solved in later kernels ? >>>>> We cannot try with latest since our kernel is custom for that platform. >>>> >>>> It is a known issue. Here's a fix I submitted during the v5.9 cycle: >>>> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=hrRVL77%2FNRvojfG2WDamDLO5dsqn3Cv6XxNbP0eGum0%3D&reserved=0 >>>> >>>> The fix hasn't been applied yet. I think I need to rework the patch, >>>> just haven't found the time. >> >> Hey Lucas - just checking again if you had a chance to push this change >> through ? It's essential to us in one of our costumer projects so we >> wonder if have any estimate when will it be up-streamed and if we can >> help with this. We would also need backporting this back to 5.11 and 5.4 >> kernels after it's upstreamed. >> >> Another point I want to mention is that this patch has a negative >> side effect on plug back times - it causes a regression point for the delay to light-up display at resume time related to back-ported AER >> >> Anatoli is working on resolving this and so maybe he can add his >> comment here and maybe you can help him with proper resolution for this. >> >> Andrey >> >>>> >>>> Since the trigger in your case are AER-handled errors during a >>>> system sleep transition, you may also want to consider the >>>> following 2-patch series by Kai-Heng Feng which is currently >>>> under discussion: >>>> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=tnLUa6J%2FLqFrlm4CfZ9l26io0bOQ7ip30d26ax05st4%3D&reserved=0 >>>> >>>> That series disables AER during a system sleep transition and >>>> should thus prevent the flood of AER-handled errors you're seeing. >>>> Once AER is disabled, the reset-induced deadlocks should go away as well. >>>> >>>> Thanks, >>>> >>>> Lukas -- Sathyanarayanan Kuppuswamy Linux Kernel Developer