I got a bounce back 'Message too long (>100000 chars)' reply so
reseeding with minimal essential log inline here
[ 56.138636] ACPI: Waking up from system sleep state S3
[ 56.140541] pcieport 0000:01:00.0: Refused to change power state,
currently in D3
[ 56.143542] pcieport 0000:02:00.0: Refused to change power state,
currently in D3
[ 56.146517] amdgpu 0000:03:00.0: Refused to change power state,
currently in D3
[ 56.209416] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal)
error received: 0000:00:01.0
[ 56.209438] pcieport 0000:00:01.1: AER: PCIe Bus Error:
severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[ 56.209440] pcieport 0000:00:01.1: AER: device [1022:15d3] error
status/mask=00004000/04400000
[ 56.209441] pcieport 0000:00:01.1: AER: [14] CmpltTO
(First)
[ 56.209817] sd 0:0:0:0: [sda] Starting disk
[ 56.211483] [drm] PCIE GART of 1024M enabled.
[ 56.211484] [drm] PTB located at 0x000000F400E10000
[ 56.211508] [drm] PSP is resuming...
[ 56.231386] [drm] reserve 0x400000 from 0xf41fc00000 for PSP TMR
[ 56.312520] amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode
is not available
[ 56.320623] amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode
is not available
[ 56.326446] [drm] kiq ring mec 2 pipe 1 q 0
[ 56.326919] amdgpu: restore the fine grain parameters
[ 56.539633] [drm] VCN decode and encode initialized
successfully(under SPG Mode).
[ 56.539655] amdgpu 0000:05:00.0: amdgpu: ring gfx uses VM inv eng 0
on hub 0
[ 56.539656] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.0 uses VM inv
eng 1 on hub 0
[ 56.539657] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.0 uses VM inv
eng 4 on hub 0
[ 56.539658] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.0 uses VM inv
eng 5 on hub 0
[ 56.539660] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.0 uses VM inv
eng 6 on hub 0
[ 56.539661] amdgpu 0000:05:00.0: amdgpu: ring comp_1.0.1 uses VM inv
eng 7 on hub 0
[ 56.539662] amdgpu 0000:05:00.0: amdgpu: ring comp_1.1.1 uses VM inv
eng 8 on hub 0
[ 56.539663] amdgpu 0000:05:00.0: amdgpu: ring comp_1.2.1 uses VM inv
eng 9 on hub 0
[ 56.539664] amdgpu 0000:05:00.0: amdgpu: ring comp_1.3.1 uses VM inv
eng 10 on hub 0
[ 56.539665] amdgpu 0000:05:00.0: amdgpu: ring kiq_2.1.0 uses VM inv
eng 11 on hub 0
[ 56.539666] amdgpu 0000:05:00.0: amdgpu: ring sdma0 uses VM inv eng 0
on hub 1
[ 56.539667] amdgpu 0000:05:00.0: amdgpu: ring vcn_dec uses VM inv eng
1 on hub 1
[ 56.539668] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc0 uses VM inv
eng 4 on hub 1
[ 56.539669] amdgpu 0000:05:00.0: amdgpu: ring vcn_enc1 uses VM inv
eng 5 on hub 1
[ 56.539670] amdgpu 0000:05:00.0: amdgpu: ring jpeg_dec uses VM inv
eng 6 on hub 1
[ 56.685926] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 56.686175] ata1.00: supports DRM functions and may not be fully
accessible
[ 56.686848] ata1.00: disabling queued TRIM support
[ 56.688408] ata1.00: supports DRM functions and may not be fully
accessible
[ 56.688925] ata1.00: disabling queued TRIM support
[ 56.690217] ata1.00: configured for UDMA/133
[ 57.246588] pcieport 0000:00:01.1: AER: Root Port link has been reset
[ 57.246635] pcieport 0000:00:01.1: AER: Device recovery failed
[ 57.246668] pcieport 0000:00:01.1: pciehp: Slot(0): Card not present
[ 57.247019] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[ 57.247198] pcieport 0000:00:01.1: AER: Multiple Uncorrected (Fatal)
error received: 0000:00:01.0
[ 57.247212] pcieport 0000:00:01.1: AER: PCIe Bus Error:
severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[ 57.247214] pcieport 0000:00:01.1: AER: device [1022:15d3] error
status/mask=00004000/04400000
[ 57.247217] pcieport 0000:00:01.1: AER: [14] CmpltTO
(First)
[ 59.038917] pci 0000:03:00.0: Removing from iommu group 21
[ 59.039314] pci_bus 0000:03: busn_res: [bus 03] is released
[ 59.039790] acpi LNXPOWER:08: Turning OFF
[ 59.040014] acpi LNXPOWER:07: Turning OFF
[ 59.040296] acpi LNXPOWER:04: Turning OFF
[ 59.040500] acpi LNXPOWER:03: Turning OFF
[ 59.040682] OOM killer enabled.
[ 59.040682] Restarting tasks ...
[ 59.041112] systemd-journald[342]: /dev/kmsg buffer overrun, some
messages lost.
[ 59.047174] done.
[ 59.047182] PM: suspend exit
[ 61.382560] show_signal_msg: 29 callbacks suppressed
[ 61.382563] glmark2[1891]: segfault at 0 ip 00007fdebc1cbd85 sp
00007ffd56800870 error 4 in radeonsi_dri.so[7fdebb972000+a94000]
[ 61.382574] Code: 00 4c 39 ed 74 6f 49 89 fc eb 1f 66 2e 0f 1f 84 00
00 00 00 00 48 89 ef e8 08 a2 7a ff 49 8b ac 24 e0 77 00 00 4c 39 ed 74
4b <48> 8b 55 00 48 8b 45 08 48 8b 5d 10 48 89 42 08 48 89 10 48 c7 45
[ 243.354138] INFO: task irq/26-aerdrv:170 blocked for more than 120
seconds.
[ 243.354145] Not tainted 5.4.2-10-feb+ #51
[ 243.354147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 243.354150] irq/26-aerdrv D 0 170 2 0x80004000
[ 243.354156] Call Trace:
[ 243.354170] ? __schedule+0x2e3/0x740
[ 243.354173] schedule+0x39/0xa0
[ 243.354179] rwsem_down_write_slowpath+0x244/0x4d0
[ 243.354183] ? schedule+0x39/0xa0
[ 243.354186] ? schedule_preempt_disabled+0xa/0x10
[ 243.354192] pciehp_reset_slot+0x51/0x150
[ 243.354198] pci_reset_hotplug_slot+0x3c/0x60
[ 243.354202] pci_slot_reset+0x107/0x130
[ 243.354205] pci_bus_error_reset+0xf3/0x120
[ 243.354210] aer_root_reset+0x5c/0xf0
[ 243.354214] pcie_do_recovery+0x13e/0x275
[ 243.354217] aer_process_err_devices+0xb2/0xc7
[ 243.354220] aer_isr.cold+0x50/0x9f
[ 243.354223] ? __schedule+0x2eb/0x740
[ 243.354228] ? irq_finalize_oneshot.part.0+0xf0/0xf0
[ 243.354230] irq_thread_fn+0x20/0x60
[ 243.354234] irq_thread+0xdc/0x170
[ 243.354237] ? irq_forced_thread_fn+0x80/0x80
[ 243.354241] kthread+0xf9/0x130
[ 243.354245] ? irq_thread_check_affinity+0xf0/0xf0
[ 243.354247] ? kthread_park+0x90/0x90
[ 243.354250] ret_from_fork+0x22/0x40
[ 243.354255] INFO: task irq/26-pciehp:171 blocked for more than 120
seconds.
[ 243.354257] Not tainted 5.4.2-10-feb+ #51
[ 243.354259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 243.354261] irq/26-pciehp D 0 171 2 0x80004000
[ 243.354263] Call Trace:
[ 243.354266] ? __schedule+0x2e3/0x740
[ 243.354269] schedule+0x39/0xa0
[ 243.354271] schedule_preempt_disabled+0xa/0x10
[ 243.354274] __mutex_lock.isra.0+0x182/0x4f0
[ 243.354279] ? irq_finalize_oneshot.part.0+0xf0/0xf0
[ 243.354284] device_del+0x35/0x370
[ 243.354288] pci_remove_bus_device+0x77/0x100
[ 243.354292] pci_remove_bus_device+0x2e/0x100
[ 243.354296] pciehp_unconfigure_device+0x7c/0x12f
[ 243.354299] pciehp_disable_slot+0x6b/0x100
[ 243.354303] pciehp_handle_presence_or_link_change+0xdc/0x140
[ 243.354306] pciehp_ist+0x10f/0x120
[ 243.354309] irq_thread_fn+0x20/0x60
[ 243.354312] irq_thread+0xdc/0x170
[ 243.354316] ? irq_forced_thread_fn+0x80/0x80
[ 243.354318] kthread+0xf9/0x130
[ 243.354321] ? irq_thread_check_affinity+0xf0/0xf0
[ 243.354323] ? kthread_park+0x90/0x90
[ 243.354326] ret_from_fork+0x22/0x40
Andrey
On 2022-02-09 14:54, Andrey Grodzovsky wrote:
Hi, on kernel based on 5.4.2 we are observing a deadlock between
reset_lock semaphore and device_lock (dev->mutex). The scenario
we do is putting the system to sleep, disconnecting the eGPU
from the PCIe bus (through a special SBIOS setting) or by simply
removing power to external PCIe cage and waking the
system up.
I attached the log. Please advise if you have any idea how
to work around it ? Since the kernel is old, does anyone
have an idea if this issue is known and already solved in later kernels ?
We cannot try with latest since our kernel is custom for that platform.
Thanks,
Andrey