[Bug 216224] New: AMDGPU fails to reset RX 480 after Ring GFX timeout

bugzilla-daemon@xxxxxxxxxx · Sat, 09 Jul 2022 06:26:53 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=216224

            Bug ID: 216224
           Summary: AMDGPU fails to reset RX 480 after Ring GFX timeout
           Product: Drivers
           Version: 2.5
    Kernel Version: 5.18.0 and earlier versions
          Hardware: Intel
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: high
          Priority: P1
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri@xxxxxxxxxxxxxxxxxxxx
          Reporter: happysmash27@xxxxxxxxxxxxxx
        Regression: No

Created attachment 301374
  --> https://bugzilla.kernel.org/attachment.cgi?id=301374&action=edit
SteamTV-induced crash 2022-07-08

This is perhaps the worst bug I have ever experienced and has been going on for
a couple years now. A ring GFX timeout can be triggered fairly reliably by
launching SteamVR while using literally anything else that uses the GPU (even
having Waterfox open usually causes a crash) or by running an Ethereum miner
(every one I have tried does this, some worse than others) and having something
else using the GPU such as Blender (moving around in a 3D scene should do it).
Other things do this as well (such as the time this happened when running Sway
and LXDE at the same time) but Ethereum miners and SteamVR seem to be the
programs that trigger it the most reliably. I imagine trying to run both at
once would be a guaranteed GPU meltdown. 

When the bug occurs, something like: 

[1767311.339261] [drm:amdgpu_job_timedout] *ERROR* ring comp_1.0.0 timeout,
signaled seq=23354655, emitted seq=23354657
[1767311.339274] [drm:amdgpu_job_timedout] *ERROR* Process information: process
vrcompositor pid 7701 thread vrcompositor pid 7701
[1767311.339279] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!

or: 

57492.984178] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled
seq=9450226, emitted seq=9450228
[57492.984189] [drm:amdgpu_job_timedout] *ERROR* Process information: process
vrcompositor pid 8198 thread vrcompositor pid 8198
[57492.984194] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!

or: 

[112094.633679] [drm:dm_vblank_get_counter] *ERROR* dc_stream_state is NULL for
crtc '2'!
[112094.633696] [drm:dm_crtc_get_scanoutpos] *ERROR* dc_stream_state is NULL
for crtc '2'!
[112094.633699] [drm:dm_vblank_get_counter] *ERROR* dc_stream_state is NULL for
crtc '2'!
[112094.633703] ------------[ cut here ]------------
[112094.633704] amdgpu 0000:03:00.0:
drm_WARN_ON_ONCE(drm_drv_uses_atomic_modeset(dev))
[112094.633735] WARNING: CPU: 9 PID: 12640 at drivers/gpu/drm/drm_vblank.c:728
drm_crtc_vblank_helper_get_vblank_timestamp_internal+0x331/0x340
[112094.633741] Modules linked in: uvcvideo videobuf2_vmalloc videobuf2_memops
videobuf2_v4l2 videobuf2_common videodev mc bpfilter btusb btrtl btbcm btintel
mptsas mptscsih mptbase
[112094.633754] CPU: 9 PID: 12640 Comm: VulkanVblankThr Not tainted
5.18.0-gentoo #1
[112094.633757] Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.1     03/17/2012
[112094.633758] RIP:
0010:drm_crtc_vblank_helper_get_vblank_timestamp_internal+0x331/0x340
[112094.633762] Code: 4c 8b 6f 50 4d 85 ed 75 03 4c 8b 2f e8 28 81 32 00 48 c7
c1 f0 2a 2d 83 4c 89 ea 48 c7 c7 f2 b0 2c 83 48 89 c6 e8 67 b6 b8 00 <0f> 0b e9
d0 fe ff ff e8 23 ec c3 00 0f 1f 00 48 8b 87 b0 01 00 00
[112094.633764] RSP: 0018:ffffc9000b1e7bb0 EFLAGS: 00010082
[112094.633766] RAX: 0000000000000000 RBX: ffffffff81aeae50 RCX:
0000000000000000
[112094.633767] RDX: 0000000000000003 RSI: ffffffff8330cd18 RDI:
00000000ffffffff
[112094.633769] RBP: ffffc9000b1e7c20 R08: ffffffff837653c8 R09:
00000000ffffdfff
[112094.633770] R10: ffffffff836853e0 R11: ffffffff8373edb8 R12:
0000000000000000
[112094.633771] R13: ffff888103ee7380 R14: 0000000000000003 R15:
ffff888e448e6b08
[112094.633773] FS:  00007fcf587f8640(0000) GS:ffff889dffac0000(0000)
knlGS:0000000000000000
[112094.633775] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[112094.633776] CR2: 00007fceec001258 CR3: 0000000363f60003 CR4:
00000000000206e0
[112094.633778] Call Trace:
[112094.633780]  <TASK>
[112094.633783]  drm_get_last_vbltimestamp+0xa5/0xb0
[112094.633789]  drm_update_vblank_count+0x7f/0x3c0
[112094.633793]  drm_vblank_enable+0x13e/0x170
[112094.633796]  drm_vblank_get+0x8b/0xd0
[112094.633798]  drm_crtc_queue_sequence_ioctl+0xee/0x2a0
[112094.633801]  ? drm_crtc_get_sequence_ioctl+0x190/0x190
[112094.633803]  drm_ioctl_kernel+0xac/0x140
[112094.633809]  drm_ioctl+0x1f5/0x3c0
[112094.633812]  ? drm_crtc_get_sequence_ioctl+0x190/0x190
[112094.633814]  ? ioctl_has_perm.constprop.0.isra.0+0xb4/0x110
[112094.633821]  amdgpu_drm_ioctl+0x44/0x80
[112094.633825]  __x64_sys_ioctl+0x7d/0xb0
[112094.633830]  do_syscall_64+0x3b/0x90
[112094.633833]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[112094.633838] RIP: 0033:0x7fcf8d9f2f0b
[112094.633840] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00
00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0
3d 00 f0 ff ff 77 1b 48 8b 44 24 18 64 48 2b 04 25 28 00
[112094.633842] RSP: 002b:00007fcf587f7aa0 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[112094.633844] RAX: ffffffffffffffda RBX: 00007fcf587f7b30 RCX:
00007fcf8d9f2f0b
[112094.633845] RDX: 00007fcf587f7b30 RSI: 00000000c018643c RDI:
000000000000004c
[112094.633847] RBP: 00000000c018643c R08: 0000000000000000 R09:
00007fceec000be0
[112094.633848] R10: 00007fcf89fc5ba0 R11: 0000000000000246 R12:
000055e7852de508
[112094.633849] R13: 000000000000004c R14: 000055e785460fd0 R15:
000055e7852de4c0
[112094.633852]  </TASK>
[112094.633852] ---[ end trace 0000000000000000 ]---
[112094.633854] [drm:dm_vblank_get_counter] *ERROR* dc_stream_state is NULL for
crtc '2'!
[112094.633857] [drm:dm_crtc_get_scanoutpos] *ERROR* dc_stream_state is NULL
for crtc '2'!
[112094.633859] [drm:dm_vblank_get_counter] *ERROR* dc_stream_state is NULL for
crtc '2'!
[112114.167264] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled
seq=3094506, emitted seq=3094508
[112114.167276] [drm:amdgpu_job_timedout] *ERROR* Process information: process
vrcompositor pid 12498 thread vrcompositor pid 12498
[112114.167281] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[112148.281473] elogind-daemon[4528]: New session 8 of user happysmash27.
[112174.774130] elogind-daemon[4528]: New session 9 of user happysmash27.
[112192.196386] Discord[5027]: segfault at 0 ip 00007ff10741e0d6 sp
00007ff1073fdd00 error 4 in discord_utils.node[7ff107400000+8d000]
[112192.196413] Code: 38 48 83 ff 01 77 09 49 8d 70 ff 48 21 ce eb 13 48 89 ce
4c 39 c1 72 0b 48 89 c8 31 d2 49 f7 f0 48 89 d6 48 8b 05 5a 47 27 00 <48> 8b 04
f0 48 85 c0 74 76 48 8b 18 48 85 db 74 6e 4d 8d 48 ff eb

will occur, all on kernel 5.18.0, but various similar issues also occured on
earlier kernel versions. These more recent ones are all triggered by SteamVR.
Ethereum miners (if I am reading this log properly) on older versions (this one
is 5.17.4) would trigger something like: 

[3986931.032554] [drm:amdgpu_job_timedout] *ERROR* ring gfx timeout, signaled
seq=296571651, emitted seq=296571653
[3986931.032566] [drm:amdgpu_job_timedout] *ERROR* Process information: process
blender-3.3 pid 11711 thread blender-3.:cs0 pid 11747
[3986931.032571] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[3986931.291361] amdgpu 0000:03:00.0: amdgpu: BACO reset
[3986931.477687] amdgpu 0000:03:00.0: amdgpu: GPU reset succeeded, trying to
resume
[3986931.478926] [drm] PCIE GART of 256M enabled (table at 0x000000F400500000).
[3986931.478942] [drm] VRAM is lost due to GPU reset!
[3986933.521790] amdgpu:
                  failed to send message 200 ret is 0
[3986937.653740] amdgpu:
                  last message was failed ret is 0
[3986939.720118] amdgpu:
                  failed to send message 100 ret is 0
[3986941.821883] amdgpu:
                  last message was failed ret is 0
[3986941.825119] amdgpu: SMU Firmware start failed!
[3986941.825123] amdgpu: Failed to load SMU ucode.
[3986941.825125] amdgpu: fw load failed
[3986941.825126] amdgpu: smu firmware loading failed
[3986941.825129] [drm] Skip scheduling IBs!
[3986941.825131] [drm] Skip scheduling IBs!
[3986941.825140] [drm] Skip scheduling IBs!
[3986941.825145] [drm] Skip scheduling IBs!
[3986941.825160] amdgpu 0000:03:00.0: amdgpu: GPU reset(2) failed

and a large amount of other messages, many repeating, including: 

[3986941.825198] [drm] Skip scheduling IBs!

[3986941.825315] [drm:amdgpu_cs_ioctl] *ERROR* Failed to initialize parser
-125!

and

[3986963.354537] amdgpu:
                  failed to send message 200 ret is 0
[3986965.419403] amdgpu:
                  last message was failed ret is 0

In all of these cases `cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover` will
generally either return some vague vague number, -11 IIRC, or will just freeze,
as has happened with my most recent SteamVR-induced freeze today. If frozen,
cat cannot be killed with signal 2 or signal 9, as it will be in a d state.
Furthermore, X and sometimes vrcompositor cannot be killed with it either as
they will also be in d states. Shutdown will do nothing, so the only way to
reboot is to manually shut down all processes then use the magic sysrq sequence
to shut down everything. In one instance even that didn't work and I needed to
hold the power key. SSH works fine as in other bugs related to ring GFX
timeout. 

The subject of this bug isn't the ring GFX timeout itself, but the fact that
the GPU reset never works. It's worked a couple of times (after manually
restarting X from SSH of course), but 90% of the time it hangs and does not
reset successfully. 

I will attach several dmesg (with color) logs from the several times this has
happened in the past couple months, but if needed I have logs going all the way
back to February 2020 as well.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.