Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device

Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> · Thu, 10 Feb 2022 15:47:10 -0500

So the patches indeed helped resolving the deadlock but when we try
again to hotplug back there is a link status failure

pcieport 0000:00:01.1: pciehp: Slot(0): Card present
pcieport 0000:00:01.1: Data Link Layer Link Active not set in 1000 msec
pcieport 0000:00:01.1: pciehp: Failed to check link status

and more detailed  bellow,
we are trying to debug but again, you might have a quick insight

Feb 10 23:37:52 amd-BILBY kernel: [   67.885459] amdgpu 0000:05:00.0: 
amdgpu: RAS: optional ras ta ucode is not available
Feb 10 23:37:52 amd-BILBY kernel: [   67.901477] amdgpu 0000:05:00.0: 
amdgpu: RAP: optional rap ta ucode is not available
Feb 10 23:37:52 amd-BILBY kernel: [   67.915376] [drm] kiq ring mec 2 
pipe 1 q 0
Feb 10 23:37:52 amd-BILBY kernel: [   67.920041] amdgpu: restore the 
fine grain parameters
Feb 10 23:37:52 amd-BILBY kernel: [   68.156714] [drm] VCN decode and 
encode initialized successfully(under SPG Mode).
Feb 10 23:37:52 amd-BILBY kernel: [   68.164222] amdgpu 0000:05:00.0: 
amdgpu: ring gfx uses VM inv eng 0 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.171275] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.178932] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.186589] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.194247] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.201906] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.209562] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.217216] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.224872] amdgpu 0000:05:00.0: 
amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.232616] amdgpu 0000:05:00.0: 
amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
Feb 10 23:37:52 amd-BILBY kernel: [   68.240272] amdgpu 0000:05:00.0: 
amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
Feb 10 23:37:52 amd-BILBY kernel: [   68.247497] amdgpu 0000:05:00.0: 
amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1
Feb 10 23:37:52 amd-BILBY kernel: [   68.249433] ata1: SATA link up 6.0 
Gbps (SStatus 133 SControl 300)
Feb 10 23:37:52 amd-BILBY kernel: [   68.254894] amdgpu 0000:05:00.0: 
amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1
Feb 10 23:37:52 amd-BILBY kernel: [   68.261315] ata1.00: supports DRM 
functions and may not be fully accessible
Feb 10 23:37:52 amd-BILBY kernel: [   68.268558] amdgpu 0000:05:00.0: 
amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1
Feb 10 23:37:52 amd-BILBY kernel: [   68.276173] ata1.00: disabling 
queued TRIM support
Feb 10 23:37:52 amd-BILBY kernel: [   68.283010] amdgpu 0000:05:00.0: 
amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1
Feb 10 23:37:52 amd-BILBY kernel: [   68.289443] ata1.00: supports DRM 
functions and may not be fully accessible
Feb 10 23:37:52 amd-BILBY kernel: [   68.302782] ata1.00: disabling 
queued TRIM support
Feb 10 23:37:52 amd-BILBY kernel: [   68.308863] ata1.00: configured for 
UDMA/133
Feb 10 23:37:52 amd-BILBY kernel: [   68.597833] pci 0000:03:00.0: 
Removing from iommu group 21
Feb 10 23:37:52 amd-BILBY kernel: [   68.597991] acpi LNXPOWER:08: 
Turning OFF
Feb 10 23:37:52 amd-BILBY kernel: [   68.605244] pci_bus 0000:03: 
busn_res: [bus 03] is released
Feb 10 23:37:52 amd-BILBY kernel: [   68.611552] acpi LNXPOWER:07: 
Turning OFF
Feb 10 23:37:52 amd-BILBY kernel: [   68.619469] pci 0000:02:00.0: 
Removing from iommu group 20
Feb 10 23:37:52 amd-BILBY kernel: [   68.626121] acpi LNXPOWER:04: 
Turning OFF
Feb 10 23:37:52 amd-BILBY kernel: [   68.632720] pci_bus 0000:02: 
busn_res: [bus 02-03] is released
Feb 10 23:37:52 amd-BILBY kernel: [   68.638105] OOM killer enabled.
Feb 10 23:37:52 amd-BILBY kernel: [   68.645106] pci 0000:01:00.0: 
Removing from iommu group 19
Feb 10 23:37:52 amd-BILBY kernel: [   68.649418] Restarting tasks ... done.
Feb 10 23:37:52 amd-BILBY kernel: [   68.662516] PM: suspend exit
Feb 10 23:37:52 amd-BILBY kernel: [   68.669613] rfkill: input handler 
disabled
Feb 10 23:37:52 amd-BILBY kernel: [   68.695045] show_signal_msg: 28 
callbacks suppressed
Feb 10 23:37:52 amd-BILBY kernel: [   68.695048] glmark2[1894]: segfault 
at 0 ip 00007f799dae6d85 sp 00007ffd34320bc0 error 4 in 
radeonsi_dri.so[7f799d28d000+a94000]
Feb 10 23:37:52 amd-BILBY kernel: [   68.711653] Code: 00 4c 39 ed 74 6f 
49 89 fc eb 1f 66 2e 0f 1f 84 00 00 00 00 00 48 89 ef e8 08 a2 7a ff 49 
8b ac 24 e0 77 00 00 4c 39 ed 74 4b <48> 8b 55 00 48 8b 45 08 48 8b 5d 
10 48 89 42 08 48 89 10 48 c7 45
Feb 10 23:37:53 amd-BILBY kernel: [   69.684921] pcieport 0000:00:01.1: 
AER: Root Port link has been reset
Feb 10 23:37:53 amd-BILBY kernel: [   69.691438] pcieport 0000:00:01.1: 
AER: Device recovery failed
Feb 10 23:37:53 amd-BILBY kernel: [   69.697327] pcieport 0000:00:01.1: 
AER: Multiple Uncorrected (Fatal) error received: 0000:00:01.0
Feb 10 23:37:53 amd-BILBY kernel: [   69.706231] pcieport 0000:00:01.1: 
AER: can't find device of ID0008
Feb 10 23:40:33 amd-BILBY kernel: [  228.769973] sysrq: HELP : 
loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) 
memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) 
show-backtrace-all-active-cpus(l) show-memory-usage(m) 
nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) 
unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V) 
show-blocked-tasks(w) dump-ftrace-buffer(z)
Feb 10 23:41:47 amd-BILBY kernel: [  302.759503] pcieport 0000:00:01.1: 
pciehp: Slot(0): Card present
Feb 10 23:41:49 amd-BILBY kernel: [  304.795473] pcieport 0000:00:01.1: 
Data Link Layer Link Active not set in 1000 msec
Feb 10 23:41:49 amd-BILBY kernel: [  304.803146] pcieport 0000:00:01.1: 
pciehp: Failed to check link status
Feb 10 23:42:30 amd-BILBY kernel: [  345.767046] pcieport 0000:00:01.1: 
pciehp: Slot(0): Card present
Feb 10 23:42:32 amd-BILBY kernel: [  347.811119] pcieport 0000:00:01.1: 
Data Link Layer Link Active not set in 1000 msec
Feb 10 23:42:32 amd-BILBY kernel: [  347.818793] pcieport 0000:00:01.1: 
pciehp: Failed to check link status
Feb 10 23:45:13 amd-BILBY kernel: [  508.465497] pcieport 0000:00:01.1: 
pciehp: Slot(0): Card present
Feb 10 23:45:15 amd-BILBY kernel: [  510.505681] pcieport 0000:00:01.1: 
Data Link Layer Link Active not set in 1000 msec
Feb 10 23:45:15 amd-BILBY kernel: [  510.513355] pcieport 0000:00:01.1: 
pciehp: Failed to check link status

Andrey

On 2022-02-10 01:23, Lukas Wunner wrote:
On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
Hi, on kernel based on 5.4.2 we are observing a deadlock between
reset_lock semaphore and device_lock (dev->mutex). The scenario
we do is putting the system to sleep, disconnecting the eGPU
from the PCIe bus (through a special SBIOS setting) or by simply
removing power to external PCIe cage and waking the
system up.

I attached the log. Please advise if you have any idea how
to work around it ? Since the kernel is old, does anyone
have an idea if this issue is known and already solved in later kernels ?
We cannot try with latest since our kernel is custom for that platform.

It is a known issue.  Here's a fix I submitted during the v5.9 cycle:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=hrRVL77%2FNRvojfG2WDamDLO5dsqn3Cv6XxNbP0eGum0%3D&amp;reserved=0

The fix hasn't been applied yet.  I think I need to rework the patch,
just haven't found the time.

Since the trigger in your case are AER-handled errors during a
system sleep transition, you may also want to consider the
following 2-patch series by Kai-Heng Feng which is currently
under discussion:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=tnLUa6J%2FLqFrlm4CfZ9l26io0bOQ7ip30d26ax05st4%3D&amp;reserved=0

That series disables AER during a system sleep transition and
should thus prevent the flood of AER-handled errors you're seeing.
Once AER is disabled, the reset-induced deadlocks should go away as well.

Thanks,

Lukas