Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2022-02-10 09:39, Andrey Grodzovsky wrote:
Thanks a lot for quick response, we will give this a try.

Andrey

On 2022-02-10 01:23, Lukas Wunner wrote:
On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
Hi, on kernel based on 5.4.2 we are observing a deadlock between
reset_lock semaphore and device_lock (dev->mutex). The scenario
we do is putting the system to sleep, disconnecting the eGPU
from the PCIe bus (through a special SBIOS setting) or by simply
removing power to external PCIe cage and waking the
system up.

I attached the log. Please advise if you have any idea how
to work around it ? Since the kernel is old, does anyone
have an idea if this issue is known and already solved in later kernels ?
We cannot try with latest since our kernel is custom for that platform.

It is a known issue.  Here's a fix I submitted during the v5.9 cycle:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=hrRVL77%2FNRvojfG2WDamDLO5dsqn3Cv6XxNbP0eGum0%3D&reserved=0

The fix hasn't been applied yet.  I think I need to rework the patch,
just haven't found the time.

Hey Lucas - just checking again if you had a chance to push this change
through ? It's essential to us in one of our costumer projects so we
wonder if have any estimate when will it be up-streamed and if we can
help with this. We would also need backporting this back to 5.11 and 5.4
kernels after it's upstreamed.

Another point I want to mention is that this patch has a negative
side effect on plug back times - it causes a regression point for the delay to light-up display at resume time related to back-ported AER

Anatoli is working on resolving this and so maybe he can add his
comment here and maybe you can help him with proper resolution for this.

Andrey


Since the trigger in your case are AER-handled errors during a
system sleep transition, you may also want to consider the
following 2-patch series by Kai-Heng Feng which is currently
under discussion:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=tnLUa6J%2FLqFrlm4CfZ9l26io0bOQ7ip30d26ax05st4%3D&reserved=0

That series disables AER during a system sleep transition and
should thus prevent the flood of AER-handled errors you're seeing.
Once AER is disabled, the reset-induced deadlocks should go away as well.

Thanks,

Lukas



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux