https://bugzilla.kernel.org/show_bug.cgi?id=219118 Bug ID: 219118 Summary: Linux 6.10.x [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout VM fault / GPU fault detected Product: Drivers Version: 2.5 Hardware: All OS: Linux Status: NEW Severity: high Priority: P3 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@xxxxxxxxxxxxxxxxxxxx Reporter: mjevans1983@xxxxxxxxx Regression: No I'm not sure if this should be filed under Console/Framebuffers, Video(DRI - non Intel), or Video(Other). I thought I'd created the bug in the correct location, https://gitlab.freedesktop.org/drm/amd/-/issues/3510 but no maintainer has commented or otherwise notably interacted with the report. Initially I thought it was just an MPV bug since VLC didn't trigger the issue https://github.com/mpv-player/mpv/issues/14600 . It looks like a developer's personal(?) drm-fixes-6.11 branch cherry picked the commit that appeared to fix the issue completely for my test cases: https://gitlab.freedesktop.org/agd5f/linux/-/commit/f3572db3c049b4d32bb5ba77ad5305616c44c7c1 However that isn't for the earlier 6.10.x series which also needs the fix, unless it's dead. This appears to be a Swiss cheese sort of bug situation. If software requests/provides contiguous buffers then the error results are more subtle, such as momentary video corruption if the kernel's access isn't out of bounds but rather rarely scrambled. It's only when both the userspace and driver don't enforce contiguous buffer segments that out of bounds accesses result in a GPU reset and consequently terminated userspace. ArchLinux (rolling release) Linux 6.10.1-arch1-1 #1 (closed) SMP PREEMPT_DYNAMIC Wed, 24 Jul 2024 22:25:43 +0000 x86_64 GNU/Linux amdgpu + OpenGL version string: 4.6 (Compatibility Profile) Mesa 24.1.4-arch1.2 ArchLinux current stable builds [ 1766.321165] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0a22c802 [ 1766.321171] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321172] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101F44 [ 1766.321174] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0C8002 [ 1766.321175] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1056580, write from 'TC3' (0x54433300) (200) [ 1766.321237] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07f2a002 [ 1766.321238] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321239] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010120C [ 1766.321240] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B020002 [ 1766.321241] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053196, write from 'CB2' (0x43423200) (32) [ 1766.321244] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07b29002 [ 1766.321245] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321247] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101237 [ 1766.321247] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B010002 [ 1766.321248] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053239, write from 'CB3' (0x43423300) (16) [ 1766.321255] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0772e002 [ 1766.321256] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321257] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101200 [ 1766.321258] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0A0002 [ 1766.321258] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053184, write from 'CB4' (0x43423400) (160) [ 1766.321262] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0772d002 [ 1766.321263] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321264] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101232 [ 1766.321264] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0A0002 [ 1766.321265] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053234, write from 'CB4' (0x43423400) (160) [ 1766.321268] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07729002 [ 1766.321269] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321271] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010123A [ 1766.321271] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B050002 [ 1766.321272] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053242, write from 'CB1' (0x43423100) (80) [ 1766.321275] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732d002 [ 1766.321276] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321277] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001012AB [ 1766.321278] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B020002 [ 1766.321279] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053355, write from 'CB2' (0x43423200) (32) [ 1766.321282] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07126002 [ 1766.321283] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321284] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010124C [ 1766.321285] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0E0002 [ 1766.321286] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053260, write from 'CB6' (0x43423600) (224) [ 1766.321289] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07b21002 [ 1766.321290] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321291] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101223 [ 1766.321292] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B050002 [ 1766.321293] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053219, write from 'CB1' (0x43423100) (80) [ 1766.321296] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732a002 [ 1766.321297] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 [ 1766.321298] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101277 [ 1766.321298] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0D0002 [ 1766.321299] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053303, write from 'CB7' (0x43423700) (208) [ 1777.234990] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=168813, emitted seq=168816 [ 1777.236251] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053219, write from 'CB1' (0x43423100) (80) Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732a002 Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101277 Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0D0002 Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053303, write from 'CB7' (0x43423700) (208) Jul 25 22:09:21 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=168813, emitted seq=168816 Jul 25 22:09:21 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 2961 thread plasmashel:cs0 pid 3007 Jul 25 22:09:21 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin! Jul 25 22:09:21 HOSTNAME kernel: amdgpu: cp is busy, skip halt cp Jul 25 22:09:22 HOSTNAME kernel: amdgpu: rlc is busy, skip halt rlc Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: BACO reset Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume Jul 25 22:09:22 HOSTNAME kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400800000). Jul 25 22:09:22 HOSTNAME kernel: [drm] VRAM is lost due to GPU reset! Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110) Jul 25 22:09:22 HOSTNAME kernel: [drm] UVD initialized successfully. Jul 25 22:09:22 HOSTNAME kernel: [drm] VCE initialized successfully. Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow start Jul 25 22:09:22 HOSTNAME mpv[5307]: amdgpu: The CS has cancelled because the context is lost. This context is innocent. Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow done Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset(2) succeeded! Jul 25 22:09:22 HOSTNAME systemd-coredump[5681]: Process 5307 (mpv) of user 1000 terminated abnormally with signal 6/ABRT, processing... Jul 25 22:09:22 HOSTNAME systemd[1]: Created slice Slice /system/drkonqi-coredump-processor. -- Subject: A start job for unit system-drkonqi\x2dcoredump\x2dprocessor.slice has finished successfully -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug.