On 2022-06-14 14:22, Sathyanarayanan Kuppuswamy wrote:
Hi,
On 6/14/22 11:07 AM, Andrey Grodzovsky wrote:
Just a gentle ping, also - I updated the ticket https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D215590&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C2bef39c2088748464bf408da4e32caca%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637908277297716792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wEEU3f5%2BrCSZZEnn0e0FTiWRbILd1ZlyYccg3k2CfQQ%3D&reserved=0
with the workaround we did if this could help you to advise us
what would be a generic solution for this ?
Andrey
Can you explain your WA? It seems to be unrelated to deadlock issue
discussed in this thread. Are they related?
So from start - originally we have an extension PCI board which is hot
plug-able into our system board. On top of this extension board we have
AMD dGPU card. Originally we observed hang on resume from sleep (S3) in
AER enabled system because of race between AER and pciehp on S3 resume
and so this
was resolved by the patch
https://lore.kernel.org/linux-pci/908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas@xxxxxxxxx/T/
Now after this we are facing a second issue where after resume and after
AER driver recovery completed for pcieport the system won't detect a new
hotplug of the extention board into the system board. Anatoli looked
into it and found the workaround that I attached that made it work by
resetting secondary bus and updating link speed on the upstream bridge
after AER recovery complete (post S3 resume). But this is just a
workaround and not a generic solution so we would like to get an advise
for a generic fix for this problem.
To reiterate the full scenario is like this
1) Boot system
2) Extension board is first time hotplugged and dGPU is added to PCI
topology
3) System suspend S3
4) WE have costum BIOS which 'shuts off' the extension board during
sleep so on resume the system discovers that the extension board (and
dGPU) are gone and hot removes it from PCI topology. Together with this
hot remove AER errors are generated and handled.
5)We again try to hot plug though a script we have but the system won't
detect the new hot plug of the extension board.
5*) The given workaround patch fixes issue in bullet 5) and hot plug
is detected and system recognizes the extension board and add it and
dGPU to PCI topology.
Andrey
On 2022-06-10 17:25, Andrey Grodzovsky wrote:
On 2022-02-10 09:39, Andrey Grodzovsky wrote:
Thanks a lot for quick response, we will give this a try.
Andrey
On 2022-02-10 01:23, Lukas Wunner wrote:
On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
Hi, on kernel based on 5.4.2 we are observing a deadlock between
reset_lock semaphore and device_lock (dev->mutex). The scenario
we do is putting the system to sleep, disconnecting the eGPU
from the PCIe bus (through a special SBIOS setting) or by simply
removing power to external PCIe cage and waking the
system up.
I attached the log. Please advise if you have any idea how
to work around it ? Since the kernel is old, does anyone
have an idea if this issue is known and already solved in later kernels ?
We cannot try with latest since our kernel is custom for that platform.
It is a known issue. Here's a fix I submitted during the v5.9 cycle:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C2bef39c2088748464bf408da4e32caca%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637908277297716792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=0mLcR5MtJ52ZPoGPZ63WqK%2BFPNCQ8tOpizKU%2BUmkuFY%3D&reserved=0
The fix hasn't been applied yet. I think I need to rework the patch,
just haven't found the time.
Hey Lucas - just checking again if you had a chance to push this change
through ? It's essential to us in one of our costumer projects so we
wonder if have any estimate when will it be up-streamed and if we can
help with this. We would also need backporting this back to 5.11 and 5.4
kernels after it's upstreamed.
Another point I want to mention is that this patch has a negative
side effect on plug back times - it causes a regression point for the delay to light-up display at resume time related to back-ported AER
Anatoli is working on resolving this and so maybe he can add his
comment here and maybe you can help him with proper resolution for this.
Andrey
Since the trigger in your case are AER-handled errors during a
system sleep transition, you may also want to consider the
following 2-patch series by Kai-Heng Feng which is currently
under discussion:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&data=05%7C01%7Candrey.grodzovsky%40amd.com%7C2bef39c2088748464bf408da4e32caca%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637908277297716792%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2F94hA3KKA9VUqisUhSaPCPIbi9IS43%2FOGManjoOh1AQ%3D&reserved=0
That series disables AER during a system sleep transition and
should thus prevent the flood of AER-handled errors you're seeing.
Once AER is disabled, the reset-induced deadlocks should go away as well.
Thanks,
Lukas