https://bugzilla.kernel.org/show_bug.cgi?id=212739 Bug ID: 212739 Summary: [amdgpu] Sporadic GPU errors, screen artifacts and GPU-induced system lockups on Vega 10 (Raven Ridge) Product: Drivers Version: 2.5 Kernel Version: 5.11.14-1, 5.12.rc7.d0411.gd434405-1 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@xxxxxxxxxxxxxxxxxxxx Reporter: tunas@xxxxxxxxxxxxx Regression: No Created attachment 296449 --> https://bugzilla.kernel.org/attachment.cgi?id=296449&action=edit Example of GPU artifacts from the recoverable variant of this error >From time to time, the amdgpu driver will report a page fault (sometimes coming from pid 0, sometimes coming from the web browser, sometimes the screen compositor or Xorg, sometimes a video player, etc.) as shown below: >kernel: amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 >ring:0 vmid:4 pasid:0, for process pid 0 thread pid 0) >kernel: amdgpu 0000:05:00.0: amdgpu: in page starting at address >0x800101606000 from client 27 >kernel: amdgpu 0000:05:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00401031 >kernel: amdgpu 0000:05:00.0: amdgpu: Faulty UTCL2 client ID: TCP >(0x8) >kernel: amdgpu 0000:05:00.0: amdgpu: MORE_FAULTS: 0x1 >kernel: amdgpu 0000:05:00.0: amdgpu: WALKER_ERROR: 0x0 >kernel: amdgpu 0000:05:00.0: amdgpu: PERMISSION_FAULTS: 0x3 >kernel: amdgpu 0000:05:00.0: amdgpu: MAPPING_ERROR: 0x0 >kernel: amdgpu 0000:05:00.0: amdgpu: RW: 0x0` This message is repeated several thousand times in dmesg ("x callbacks suppressed") with different addresses of form 0x80010160Y000 (where Y is a hex digit between 1-8.) In the meantime, the computer is completely hung in terms of display, i.e. inputs go through, music keeps playing, but the screen is static. Then, several seconds later, it's followed by: >kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences >timed out! And finally, >[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft >recovered After this, the computer resumes operation (but with GPU artifacts having appeared on the screen - for an example of these, see attached screenshot). Alternatively, sometimes instead of the soft recovery message, the GPU cannot recover and displays the following messages in the kernel log: >kernel: [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access >in command stream >kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled >seq=3356413, emitted seq=3356415 >kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: >process Xorg pid 14524 thread Xorg:cs0 pid 14539 >kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin! >kernel: [drm] free PSP TMR buffer >kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset >kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume >kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000). >kernel: [drm] PSP is resuming... >kernel: [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR >kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not >available >kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not >available >kernel: [drm] kiq ring mec 2 pipe 1 q 0 >kernel: amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* >ring sdma0 test failed (-110) >kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP >block <sdma_v4_0> failed -110 >kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset(4) failed >kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110 at which point rebooting is necessary as the GPU will not resume operation. This also happens on the latest 5.12 rc (as of the writing of this bug report, this is rc7). -- You may reply to this email to add a comment. You are receiving this mail because: You are watching the assignee of the bug. _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel