Re: [Bug 215958] New: thunderbolt3 egpu cannot disconnect cleanly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2022-05-09 14:03, Deucher, Alexander wrote:
[Public]

-----Original Message-----
From: Bjorn Helgaas <bjorn.helgaas@xxxxxxxxx>
Sent: Monday, May 9, 2022 12:23 PM
To: Linux PCI <linux-pci@xxxxxxxxxxxxxxx>
Cc: r087r70@xxxxxxxx; Deucher, Alexander
<Alexander.Deucher@xxxxxxx>; Koenig, Christian
<Christian.Koenig@xxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>; amd-gfx
mailing list <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>; dri-devel <dri-
devel@xxxxxxxxxxxxxxxxxxxxx>
Subject: Re: [Bug 215958] New: thunderbolt3 egpu cannot disconnect cleanly

On Sun, May 8, 2022 at 3:29 PM <bugzilla-daemon@xxxxxxxxxx> wrote:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugz

illa.kernel.org%2Fshow_bug.cgi%3Fid%3D215958&amp;data=05%7C01%7Cal
exan
der.deucher%40amd.com%7C8bb8567427844b05e5f808da31d8435f%7C3d
d8961fe48
84e608e11a82d994e183d%7C0%7C0%7C637877102168668221%7CUnkno
wn%7CTWFpbGZ
sb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn
0%3
D%7C3000%7C%7C%7C&amp;sdata=PpcDBIpUW8vCX%2F4kM6Q8RjdgS1qw2
uuWoWZXis4M
dDQ%3D&amp;reserved=0

             Bug ID: 215958
            Summary: thunderbolt3 egpu cannot disconnect cleanly
            Product: Drivers
            Version: 2.5
     Kernel Version: 5.17.0-1003-oem #3-Ubuntu SMP PREEMPT
           Hardware: All
                 OS: Linux
               Tree: Mainline
             Status: NEW
           Severity: normal
           Priority: P1
          Component: PCI
           Assignee: drivers_pci@xxxxxxxxxxxxxxxxxxxx
           Reporter: r087r70@xxxxxxxx
         Regression: No
I assume this is not a regression, right?  If it is a regression, what previous
kernel worked correctly?

I have an external egpu (Radeon 6600 RX) connected through
thunderbolt3 to my Thinkpad X1 carbon 6th Gen.. When I disconnect the
thunderbolt3 cable I get the following error in dmesg:

[21874.194994] amdgpu 0000:0c:00.0: amdgpu: SMU:
response:0xFFFFFFFF
for
index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.195006] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU
metrics table!
[21874.195123] amdgpu 0000:0c:00.0: amdgpu: SMU:
response:0xFFFFFFFF
for
index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.195129] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU
metrics table!
[21874.195271] amdgpu 0000:0c:00.0: amdgpu: SMU:
response:0xFFFFFFFF
for
index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.195276] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU
metrics table!
[21874.195406] amdgpu 0000:0c:00.0: amdgpu: SMU:
response:0xFFFFFFFF
for
index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.195411] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU
metrics table!
[21874.195544] amdgpu 0000:0c:00.0: amdgpu: SMU:
response:0xFFFFFFFF
for
index:51 param:0x00000000 message:GetPptLimit?
[21874.195550] amdgpu 0000:0c:00.0: amdgpu:
[smu_v11_0_get_current_power_limit]
get PPT limit failed!
[21874.195582] amdgpu 0000:0c:00.0: amdgpu: SMU:
response:0xFFFFFFFF
for
index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.195587] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU
metrics table!
[21874.227454] amdgpu 0000:0c:00.0: amdgpu: SMU:
response:0xFFFFFFFF
for
index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.227463] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU
metrics table!
[21874.227532] amdgpu 0000:0c:00.0: amdgpu: SMU:
response:0xFFFFFFFF
for
index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.227536] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU
metrics table!
[21874.227618] amdgpu 0000:0c:00.0: amdgpu: SMU:
response:0xFFFFFFFF
for
index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.227621] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU
metrics table!
[21874.227700] amdgpu 0000:0c:00.0: amdgpu: SMU:
response:0xFFFFFFFF
for
index:18 param:0x00000005 message:TransferTableSmu2Dram?
[21874.227703] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU
metrics table!
[21874.227784] amdgpu 0000:0c:00.0: amdgpu:
[smu_v11_0_get_current_power_limit]
get PPT limit failed!
[21874.227804] amdgpu 0000:0c:00.0: amdgpu: Failed to export SMU
metrics table!
[21874.514661] snd_hda_codec_hdmi hdaudioC1D0: Unable to sync
register
0x2f0d00. -5 [21874.568360] amdgpu 0000:0c:00.0: amdgpu: Failed to
switch to AC mode!
[21874.599292] amdgpu 0000:0c:00.0: amdgpu: Failed to switch to AC
mode!
[21874.718562] amdgpu 0000:0c:00.0: amdgpu: amdgpu: finishing device.
[21878.722376] amdgpu: cp queue pipe 4 queue 0 preemption failed
[21878.722422] amdgpu 0000:0c:00.0: amdgpu: Failed to disable gfxoff!
[21879.134918] amdgpu 0000:0c:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]]
*ERROR* ring kiq_2.1.0 test failed (-110) [21879.135144]
[drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[21879.338158] amdgpu 0000:0c:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]]
*ERROR* ring kiq_2.1.0 test failed (-110) [21879.338402]
[drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[21879.543318] [drm:gfx_v10_0_cp_gfx_enable.isra.0 [amdgpu]] *ERROR*
failed to halt cp gfx [21879.544216] __smu_cmn_reg_print_error: 5
callbacks suppressed [21879.544220] amdgpu 0000:0c:00.0: amdgpu:
SMU:
response:0xFFFFFFFF for
index:7 param:0x00000000 message:DisableAllSmuFeatures?
[21879.544226] amdgpu 0000:0c:00.0: amdgpu: Failed to disable smu
features.
[21879.544230] amdgpu 0000:0c:00.0: amdgpu: Fail to disable dpm
features!
[21879.544238] [drm] free PSP TMR buffer
The above looks like what amdgpu would see when the GPU is no longer
accessible (writes are dropped and reads return 0xffffffff).  It's possible
amdgpu could notice this and shut down more gracefully, but I don't think it's
the main problem here and it probably wouldn't force you to reboot.
+ Andrey who has been working on properly handling PCI hotplug on AMD GPUs.


Added comment in the ticket.

Andrey



[21880.455935] i915 0000:00:02.0: vgaarb: changed VGA decodes:
olddecodes=none,decodes=io+mem:owns=io+mem
[21880.456218] pci 0000:0c:00.0: Removing from iommu group 14
[21880.456715] pci 0000:0c:00.1: Removing from iommu group 14
[21880.456798] pci_bus 0000:0c: busn_res: [bus 0c] is released
[21880.456950] pci 0000:0b:00.0: Removing from iommu group 14
[21880.456985] pci_bus 0000:0b: busn_res: [bus 0b-0c] is released
[21880.457106] pci 0000:0a:00.0: Removing from iommu group 14
[21880.457156] pci_bus 0000:0a: busn_res: [bus 0a-0c] is released
[21880.457279] pci 0000:09:01.0: Removing from iommu group 14
[21880.457311] pci_bus 0000:09: busn_res: [bus 09-3a] is released
[21880.457543] pci 0000:08:00.0: Removing from iommu group 14
This looks like removing 0c:00.0 (the GPU) and two switches leading to it
(probably part of the Thunderbolt topology), so to be expected.

The GPU actually consists of multiple PCI devices, depending on the generation.  Back when HDMI audio became a thing, and audio endpoint was added. Then more recently we added upstream and downstream PCI ports which connect the GPU devices to the system.  On the GPU side of the ports is the GPU, audio, and often USB and I2C (for UCSI).

[21880.457847] pci_bus 0000:06: Allocating resources [21880.457888]
pcieport 0000:06:02.0: bridge window [io  0x1000-0x0fff] to [bus 3b]
add_size 1000 [21880.457897] pcieport 0000:06:04.0: bridge window [io
0x1000-0x0fff] to [bus 3c-6f] add_size 1000 [21880.457913] pcieport
0000:06:02.0: BAR 13: no space for [io  size 0x1000] [21880.457919]
pcieport 0000:06:02.0: BAR 13: failed to assign [io  size 0x1000]
[21880.457924] pcieport 0000:06:04.0: BAR 13: no space for [io  size
0x1000] [21880.457928] pcieport 0000:06:04.0: BAR 13: failed to assign
[io  size 0x1000] [21880.457934] pcieport 0000:06:04.0: BAR 13: no
space for [io  size 0x1000] [21880.457938] pcieport 0000:06:04.0: BAR
13: failed to assign [io  size 0x1000] [21880.457943] pcieport
0000:06:02.0: BAR 13: no space for [io  size 0x1000] [21880.457947]
pcieport 0000:06:02.0: BAR 13: failed to assign [io  size 0x1000]
I'm not sure why we're allocating resources as part of the removal.
The hierarchies under 06:02.0 (to [bus 3b]) and 06:04.0 (to [bus
3c-6f]) seem to be siblings of the hierarchy you just removed (my guess is that
was 06:01.0 to [bus 08-3a]).  But again, shouldn't require a reboot.

upon reconnection of the cable I get:

[22192.753261] input: HDA ATI HDMI HDMI/DP,pcm=3 as

/devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00
.
0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/inp
u
t98 [22192.753738] input: HDA ATI HDMI HDMI/DP,pcm=7 as

/devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00
.
0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/inp
u
t99 [22192.753952] input: HDA ATI HDMI HDMI/DP,pcm=8 as

/devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00
.
0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/inp
u
t100 [22192.755234] input: HDA ATI HDMI HDMI/DP,pcm=9 as

/devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00
.
0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/inp
u
t101 [22192.763885] input: HDA ATI HDMI HDMI/DP,pcm=10 as

/devices/pci0000:00/0000:00:1d.0/0000:05:00.0/0000:06:01.0/0000:08:00
.
0/0000:09:01.0/0000:0a:00.0/0000:0b:00.0/0000:0c:00.1/sound/card1/inp
u
t102 [22192.975773] thunderbolt 0-1: new device found, vendor=0x127
device=0x1 [22192.975786] thunderbolt 0-1: Razer Core X

but the egpu no longer appears in `xrandr --listproviders`. Full
reboot is needed.
Can you please build with CONFIG_DYNAMIC_DEBUG=y, boot with
'dyndbg="file pciehp* +p"', and attach the complete dmesg log to the
bugzilla?  Also please attach the complete "sudo lspci -vv" output (before the
unplug and after the replug)?



[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux