Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device

Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> · Wed, 15 Jun 2022 11:49:31 -0400

On 2022-06-15 11:14, Sathyanarayanan Kuppuswamy wrote:

On 6/14/22 1:35 PM, Andrey Grodzovsky wrote:

On 2022-06-14 14:22, Sathyanarayanan Kuppuswamy wrote:
Hi,

On 6/14/22 11:07 AM, Andrey Grodzovsky wrote:
Just a gentle ping, also - I updated the ticket https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D215590&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C407a04694abb44cad1a908da4ee1c371%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909028798586221%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=yJ3FgPSbH52kEMmoMjcgmU0apo9LZYtWwe%2B%2Bn%2F4J30U%3D&amp;reserved=0

with the workaround we did if this could help you to advise us
what would be a generic solution for this ?

Andrey
Can you explain your WA? It seems to be unrelated to deadlock issue
discussed in this thread. Are they related?

So from start - originally we have an extension PCI board which is hot plug-able into our system board. On top of this extension board we have
AMD dGPU card. Originally we observed hang on resume from sleep (S3) in
AER enabled system because of race between AER and pciehp on S3 resume and so this
was resolved by the patch https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2FT%2F&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C407a04694abb44cad1a908da4ee1c371%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909028798586221%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=JCzeDmJmByiqDeAZGYJUjgOW2VIAMybHZgg%2B0YzYd%2Fg%3D&amp;reserved=0

There is patch to disable AER in suspend/resume path (from Kai-Heng Feng). Did
you check with this patch?

Yes, this patcheset[1] had no impact on the problem and we only included
the AB-BA Deadlock patch in our code by Lukas since it resolved the SW
deadlock for us.

[1] - 
https://patchwork.kernel.org/project/linux-pci/patch/20220126071853.1940111-1-kai.heng.feng@xxxxxxxxxxxxx/

Now after this we are facing a second issue where after resume and after
AER driver recovery completed for pcieport the system won't detect a new
hotplug of the extention board into the system board. Anatoli looked

What about the hotplug events during this sequence? Did you get the
LINK DOWN/UP or Presence change events?

I think we do get them - both in first time hot plug (step 2)
bellow and in post S3 resume hot plug (step 5 bellow). It's just
that it seems we get timeout for pcie_wait_for_link in step 5)

Step 2) logs
Feb 10 23:36:59 amd-BILBY kernel: [   28.729523] pcieport 0000:00:01.1: 
pciehp: Slot(0): Card present
Feb 10 23:36:59 amd-BILBY kernel: [   28.735552] pcieport 0000:00:01.1: 
pciehp: Slot(0): Link Up

Step 5) logs
Feb 10 23:41:47 amd-BILBY kernel: [  302.759503] pcieport 0000:00:01.1: 
pciehp: Slot(0): Card present
Feb 10 23:41:49 amd-BILBY kernel: [  304.795473] pcieport 0000:00:01.1: 
Data Link Layer Link Active not set in 1000 msec
Feb 10 23:41:49 amd-BILBY kernel: [  304.803146] pcieport 0000:00:01.1: 
pciehp: Failed to check link status

But maybe you meant something else and if so maybe you can
tell me what exactly you want me to look at ?

Andrey

into it and found the workaround that I attached that made it work by
resetting secondary bus and updating link speed on the upstream bridge
after AER recovery complete (post S3 resume).  But this is just a

workaround and not a generic solution so we would like to get an advise for a generic fix for this problem.

To reiterate the full scenario is like this

1) Boot system

2) Extension board is first time hotplugged and dGPU is added to PCI topology

3) System suspend S3

4)  WE have costum BIOS which 'shuts off' the extension board during sleep so on resume the system discovers that the extension board (and dGPU) are gone and hot removes it from PCI topology. Together with this hot remove AER errors are generated and handled.

5)We again try to hot plug though a script we have but the system won't
detect the new hot plug of the extension board.

5*) The given workaround patch fixes issue in bullet 5) and hot plug
is detected and system recognizes the extension board and add it and dGPU to PCI topology.

Andrey

On 2022-06-10 17:25, Andrey Grodzovsky wrote:

On 2022-02-10 09:39, Andrey Grodzovsky wrote:
Thanks a lot for quick response, we will give this a try.

Andrey

On 2022-02-10 01:23, Lukas Wunner wrote:
On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
Hi, on kernel based on 5.4.2 we are observing a deadlock between
reset_lock semaphore and device_lock (dev->mutex). The scenario
we do is putting the system to sleep, disconnecting the eGPU
from the PCIe bus (through a special SBIOS setting) or by simply
removing power to external PCIe cage and waking the
system up.

I attached the log. Please advise if you have any idea how
to work around it ? Since the kernel is old, does anyone
have an idea if this issue is known and already solved in later kernels ?
We cannot try with latest since our kernel is custom for that platform.

It is a known issue.  Here's a fix I submitted during the v5.9 cycle:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C407a04694abb44cad1a908da4ee1c371%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909028798586221%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=LchNztBhnuGsXC7Shn9AFc%2BRBk%2Bp%2B6O6Vq%2Fj9AzXBxI%3D&amp;reserved=0

The fix hasn't been applied yet.  I think I need to rework the patch,
just haven't found the time.

Hey Lucas - just checking again if you had a chance to push this change
through ? It's essential to us in one of our costumer projects so we
wonder if have any estimate when will it be up-streamed and if we can
help with this. We would also need backporting this back to 5.11 and 5.4
kernels after it's upstreamed.

Another point I want to mention is that this patch has a negative
side effect on plug back times - it causes a regression point for the delay to light-up display at resume time related to back-ported AER

Anatoli is working on resolving this and so maybe he can add his
comment here and maybe you can help him with proper resolution for this.

Andrey

Since the trigger in your case are AER-handled errors during a
system sleep transition, you may also want to consider the
following 2-patch series by Kai-Heng Feng which is currently
under discussion:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&amp;data=05%7C01%7Candrey.grodzovsky%40amd.com%7C407a04694abb44cad1a908da4ee1c371%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637909028798586221%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=I%2FkE9XrIbeeWE%2F8IHXnD%2B3%2BhOnQ2TqgZqlpr9ViKiaI%3D&amp;reserved=0

That series disables AER during a system sleep transition and
should thus prevent the flood of AER-handled errors you're seeing.
Once AER is disabled, the reset-induced deadlocks should go away as well.

Thanks,

Lukas