Re: Question about deadlock between AER and pceihp interrupts during resume from S3 with unplugged device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



So the patches indeed helped resolving the deadlock but when we try
again to hotplug back there is a link status failure

pcieport 0000:00:01.1: pciehp: Slot(0): Card present
pcieport 0000:00:01.1: Data Link Layer Link Active not set in 1000 msec
pcieport 0000:00:01.1: pciehp: Failed to check link status

and more detailed  bellow,
we are trying to debug but again, you might have a quick insight

Feb 10 23:37:52 amd-BILBY kernel: [ 67.885459] amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available Feb 10 23:37:52 amd-BILBY kernel: [ 67.901477] amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available Feb 10 23:37:52 amd-BILBY kernel: [ 67.915376] [drm] kiq ring mec 2 pipe 1 q 0 Feb 10 23:37:52 amd-BILBY kernel: [ 67.920041] amdgpu: restore the fine grain parameters Feb 10 23:37:52 amd-BILBY kernel: [ 68.156714] [drm] VCN decode and encode initialized successfully(under SPG Mode). Feb 10 23:37:52 amd-BILBY kernel: [ 68.164222] amdgpu 0000:05:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0 Feb 10 23:37:52 amd-BILBY kernel: [ 68.171275] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 Feb 10 23:37:52 amd-BILBY kernel: [ 68.178932] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 Feb 10 23:37:52 amd-BILBY kernel: [ 68.186589] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0 Feb 10 23:37:52 amd-BILBY kernel: [ 68.194247] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0 Feb 10 23:37:52 amd-BILBY kernel: [ 68.201906] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0 Feb 10 23:37:52 amd-BILBY kernel: [ 68.209562] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0 Feb 10 23:37:52 amd-BILBY kernel: [ 68.217216] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0 Feb 10 23:37:52 amd-BILBY kernel: [ 68.224872] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0 Feb 10 23:37:52 amd-BILBY kernel: [ 68.232616] amdgpu 0000:05:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0 Feb 10 23:37:52 amd-BILBY kernel: [ 68.240272] amdgpu 0000:05:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1 Feb 10 23:37:52 amd-BILBY kernel: [ 68.247497] amdgpu 0000:05:00.0: amdgpu: ring vcn_dec uses VM inv eng 1 on hub 1 Feb 10 23:37:52 amd-BILBY kernel: [ 68.249433] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Feb 10 23:37:52 amd-BILBY kernel: [ 68.254894] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 4 on hub 1 Feb 10 23:37:52 amd-BILBY kernel: [ 68.261315] ata1.00: supports DRM functions and may not be fully accessible Feb 10 23:37:52 amd-BILBY kernel: [ 68.268558] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 5 on hub 1 Feb 10 23:37:52 amd-BILBY kernel: [ 68.276173] ata1.00: disabling queued TRIM support Feb 10 23:37:52 amd-BILBY kernel: [ 68.283010] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv eng 6 on hub 1 Feb 10 23:37:52 amd-BILBY kernel: [ 68.289443] ata1.00: supports DRM functions and may not be fully accessible Feb 10 23:37:52 amd-BILBY kernel: [ 68.302782] ata1.00: disabling queued TRIM support Feb 10 23:37:52 amd-BILBY kernel: [ 68.308863] ata1.00: configured for UDMA/133 Feb 10 23:37:52 amd-BILBY kernel: [ 68.597833] pci 0000:03:00.0: Removing from iommu group 21 Feb 10 23:37:52 amd-BILBY kernel: [ 68.597991] acpi LNXPOWER:08: Turning OFF Feb 10 23:37:52 amd-BILBY kernel: [ 68.605244] pci_bus 0000:03: busn_res: [bus 03] is released Feb 10 23:37:52 amd-BILBY kernel: [ 68.611552] acpi LNXPOWER:07: Turning OFF Feb 10 23:37:52 amd-BILBY kernel: [ 68.619469] pci 0000:02:00.0: Removing from iommu group 20 Feb 10 23:37:52 amd-BILBY kernel: [ 68.626121] acpi LNXPOWER:04: Turning OFF Feb 10 23:37:52 amd-BILBY kernel: [ 68.632720] pci_bus 0000:02: busn_res: [bus 02-03] is released
Feb 10 23:37:52 amd-BILBY kernel: [   68.638105] OOM killer enabled.
Feb 10 23:37:52 amd-BILBY kernel: [ 68.645106] pci 0000:01:00.0: Removing from iommu group 19
Feb 10 23:37:52 amd-BILBY kernel: [   68.649418] Restarting tasks ... done.
Feb 10 23:37:52 amd-BILBY kernel: [   68.662516] PM: suspend exit
Feb 10 23:37:52 amd-BILBY kernel: [ 68.669613] rfkill: input handler disabled Feb 10 23:37:52 amd-BILBY kernel: [ 68.695045] show_signal_msg: 28 callbacks suppressed Feb 10 23:37:52 amd-BILBY kernel: [ 68.695048] glmark2[1894]: segfault at 0 ip 00007f799dae6d85 sp 00007ffd34320bc0 error 4 in radeonsi_dri.so[7f799d28d000+a94000] Feb 10 23:37:52 amd-BILBY kernel: [ 68.711653] Code: 00 4c 39 ed 74 6f 49 89 fc eb 1f 66 2e 0f 1f 84 00 00 00 00 00 48 89 ef e8 08 a2 7a ff 49 8b ac 24 e0 77 00 00 4c 39 ed 74 4b <48> 8b 55 00 48 8b 45 08 48 8b 5d 10 48 89 42 08 48 89 10 48 c7 45 Feb 10 23:37:53 amd-BILBY kernel: [ 69.684921] pcieport 0000:00:01.1: AER: Root Port link has been reset Feb 10 23:37:53 amd-BILBY kernel: [ 69.691438] pcieport 0000:00:01.1: AER: Device recovery failed Feb 10 23:37:53 amd-BILBY kernel: [ 69.697327] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal) error received: 0000:00:01.0 Feb 10 23:37:53 amd-BILBY kernel: [ 69.706231] pcieport 0000:00:01.1: AER: can't find device of ID0008 Feb 10 23:40:33 amd-BILBY kernel: [ 228.769973] sysrq: HELP : loglevel(0-9) reboot(b) crash(c) terminate-all-tasks(e) memory-full-oom-kill(f) kill-all-tasks(i) thaw-filesystems(j) sak(k) show-backtrace-all-active-cpus(l) show-memory-usage(m) nice-all-RT-tasks(n) poweroff(o) show-registers(p) show-all-timers(q) unraw(r) sync(s) show-task-states(t) unmount(u) force-fb(V) show-blocked-tasks(w) dump-ftrace-buffer(z) Feb 10 23:41:47 amd-BILBY kernel: [ 302.759503] pcieport 0000:00:01.1: pciehp: Slot(0): Card present Feb 10 23:41:49 amd-BILBY kernel: [ 304.795473] pcieport 0000:00:01.1: Data Link Layer Link Active not set in 1000 msec Feb 10 23:41:49 amd-BILBY kernel: [ 304.803146] pcieport 0000:00:01.1: pciehp: Failed to check link status Feb 10 23:42:30 amd-BILBY kernel: [ 345.767046] pcieport 0000:00:01.1: pciehp: Slot(0): Card present Feb 10 23:42:32 amd-BILBY kernel: [ 347.811119] pcieport 0000:00:01.1: Data Link Layer Link Active not set in 1000 msec Feb 10 23:42:32 amd-BILBY kernel: [ 347.818793] pcieport 0000:00:01.1: pciehp: Failed to check link status Feb 10 23:45:13 amd-BILBY kernel: [ 508.465497] pcieport 0000:00:01.1: pciehp: Slot(0): Card present Feb 10 23:45:15 amd-BILBY kernel: [ 510.505681] pcieport 0000:00:01.1: Data Link Layer Link Active not set in 1000 msec Feb 10 23:45:15 amd-BILBY kernel: [ 510.513355] pcieport 0000:00:01.1: pciehp: Failed to check link status

Andrey

On 2022-02-10 01:23, Lukas Wunner wrote:
On Wed, Feb 09, 2022 at 02:54:06PM -0500, Andrey Grodzovsky wrote:
Hi, on kernel based on 5.4.2 we are observing a deadlock between
reset_lock semaphore and device_lock (dev->mutex). The scenario
we do is putting the system to sleep, disconnecting the eGPU
from the PCIe bus (through a special SBIOS setting) or by simply
removing power to external PCIe cage and waking the
system up.

I attached the log. Please advise if you have any idea how
to work around it ? Since the kernel is old, does anyone
have an idea if this issue is known and already solved in later kernels ?
We cannot try with latest since our kernel is custom for that platform.

It is a known issue.  Here's a fix I submitted during the v5.9 cycle:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F908047f7699d9de9ec2efd6b79aa752d73dab4b6.1595329748.git.lukas%40wunner.de%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=hrRVL77%2FNRvojfG2WDamDLO5dsqn3Cv6XxNbP0eGum0%3D&amp;reserved=0

The fix hasn't been applied yet.  I think I need to rework the patch,
just haven't found the time.

Since the trigger in your case are AER-handled errors during a
system sleep transition, you may also want to consider the
following 2-patch series by Kai-Heng Feng which is currently
under discussion:

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flinux-pci%2F20220127025418.1989642-1-kai.heng.feng%40canonical.com%2F&amp;data=04%7C01%7Candrey.grodzovsky%40amd.com%7Cba698967471548d739c108d9ec5dcf6c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637800710411446272%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&amp;sdata=tnLUa6J%2FLqFrlm4CfZ9l26io0bOQ7ip30d26ax05st4%3D&amp;reserved=0

That series disables AER during a system sleep transition and
should thus prevent the flood of AER-handled errors you're seeing.
Once AER is disabled, the reset-induced deadlocks should go away as well.

Thanks,

Lukas



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux