[Bug 219118] New: Linux 6.10.x [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout VM fault / GPU fault detected

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



https://bugzilla.kernel.org/show_bug.cgi?id=219118

            Bug ID: 219118
           Summary: Linux 6.10.x [drm:amdgpu_job_timedout [amdgpu]]
                    *ERROR* ring gfx timeout VM fault / GPU fault detected
           Product: Drivers
           Version: 2.5
          Hardware: All
                OS: Linux
            Status: NEW
          Severity: high
          Priority: P3
         Component: Video(DRI - non Intel)
          Assignee: drivers_video-dri@xxxxxxxxxxxxxxxxxxxx
          Reporter: mjevans1983@xxxxxxxxx
        Regression: No

I'm not sure if this should be filed under Console/Framebuffers, Video(DRI -
non Intel), or Video(Other).

I thought I'd created the bug in the correct location,
https://gitlab.freedesktop.org/drm/amd/-/issues/3510 but no maintainer has
commented or otherwise notably interacted with the report.  Initially I thought
it was just an MPV bug since VLC didn't trigger the issue
https://github.com/mpv-player/mpv/issues/14600 .

It looks like a developer's personal(?) drm-fixes-6.11 branch cherry picked the
commit that appeared to fix the issue completely for my test cases:
https://gitlab.freedesktop.org/agd5f/linux/-/commit/f3572db3c049b4d32bb5ba77ad5305616c44c7c1

However that isn't for the earlier 6.10.x series which also needs the fix,
unless it's dead.

This appears to be a Swiss cheese sort of bug situation.  If software
requests/provides contiguous buffers then the error results are more subtle,
such as momentary video corruption if the kernel's access isn't out of bounds
but rather rarely scrambled.  It's only when both the userspace and driver
don't enforce contiguous buffer segments that out of bounds accesses result in
a GPU reset and consequently terminated userspace.


ArchLinux (rolling release)
Linux 6.10.1-arch1-1 #1 (closed) SMP PREEMPT_DYNAMIC Wed, 24 Jul 2024 22:25:43
+0000 x86_64 GNU/Linux
amdgpu + OpenGL version string: 4.6 (Compatibility Profile) Mesa 24.1.4-arch1.2
ArchLinux current stable builds


[ 1766.321165] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0a22c802
[ 1766.321171] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961
thread plasmashel:cs0 pid 3007
[ 1766.321172] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
  0x00101F44
[ 1766.321174] amdgpu 0000:01:00.0: amdgpu:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0C8002
[ 1766.321175] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid
32772) at page 1056580, write from 'TC3' (0x54433300) (200)
[ 1766.321237] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07f2a002
[ 1766.321238] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961
thread plasmashel:cs0 pid 3007
[ 1766.321239] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
  0x0010120C
[ 1766.321240] amdgpu 0000:01:00.0: amdgpu:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B020002
[ 1766.321241] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid
32772) at page 1053196, write from 'CB2' (0x43423200) (32)
[ 1766.321244] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07b29002
[ 1766.321245] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961
thread plasmashel:cs0 pid 3007
[ 1766.321247] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
  0x00101237
[ 1766.321247] amdgpu 0000:01:00.0: amdgpu:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B010002
[ 1766.321248] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid
32772) at page 1053239, write from 'CB3' (0x43423300) (16)
[ 1766.321255] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0772e002
[ 1766.321256] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961
thread plasmashel:cs0 pid 3007
[ 1766.321257] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
  0x00101200
[ 1766.321258] amdgpu 0000:01:00.0: amdgpu:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0A0002
[ 1766.321258] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid
32772) at page 1053184, write from 'CB4' (0x43423400) (160)
[ 1766.321262] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0772d002
[ 1766.321263] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961
thread plasmashel:cs0 pid 3007
[ 1766.321264] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
  0x00101232
[ 1766.321264] amdgpu 0000:01:00.0: amdgpu:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0A0002
[ 1766.321265] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid
32772) at page 1053234, write from 'CB4' (0x43423400) (160)
[ 1766.321268] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07729002
[ 1766.321269] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961
thread plasmashel:cs0 pid 3007
[ 1766.321271] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
  0x0010123A
[ 1766.321271] amdgpu 0000:01:00.0: amdgpu:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B050002
[ 1766.321272] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid
32772) at page 1053242, write from 'CB1' (0x43423100) (80)
[ 1766.321275] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732d002
[ 1766.321276] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961
thread plasmashel:cs0 pid 3007
[ 1766.321277] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
  0x001012AB
[ 1766.321278] amdgpu 0000:01:00.0: amdgpu:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B020002
[ 1766.321279] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid
32772) at page 1053355, write from 'CB2' (0x43423200) (32)
[ 1766.321282] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07126002
[ 1766.321283] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961
thread plasmashel:cs0 pid 3007
[ 1766.321284] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
  0x0010124C
[ 1766.321285] amdgpu 0000:01:00.0: amdgpu:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0E0002
[ 1766.321286] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid
32772) at page 1053260, write from 'CB6' (0x43423600) (224)
[ 1766.321289] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07b21002
[ 1766.321290] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961
thread plasmashel:cs0 pid 3007
[ 1766.321291] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
  0x00101223
[ 1766.321292] amdgpu 0000:01:00.0: amdgpu:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B050002
[ 1766.321293] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid
32772) at page 1053219, write from 'CB1' (0x43423100) (80)
[ 1766.321296] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732a002
[ 1766.321297] amdgpu 0000:01:00.0: amdgpu:  for process plasmashell pid 2961
thread plasmashel:cs0 pid 3007
[ 1766.321298] amdgpu 0000:01:00.0: amdgpu:   VM_CONTEXT1_PROTECTION_FAULT_ADDR
  0x00101277
[ 1766.321298] amdgpu 0000:01:00.0: amdgpu:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0D0002
[ 1766.321299] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid
32772) at page 1053303, write from 'CB7' (0x43423700) (208)
[ 1777.234990] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
signaled seq=168813, emitted seq=168816
[ 1777.236251] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information:
process plasmashell pid 2961 thread plasmashel:cs0 pid 3007

Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02,
vmid 5, pasid 32772) at page 1053219, write from 'CB1' (0x43423100) (80)
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU fault
detected: 147 0x0732a002
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu:  for process
plasmashell pid 2961 thread plasmashel:cs0 pid 3007
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu:  
VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101277
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu:  
VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0D0002
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02,
vmid 5, pasid 32772) at page 1053303, write from 'CB7' (0x43423700) (208)
Jul 25 22:09:21 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
ring gfx timeout, signaled seq=168813, emitted seq=168816
Jul 25 22:09:21 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
Process information: process plasmashell pid 2961 thread plasmashel:cs0 pid
3007
Jul 25 22:09:21 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
Jul 25 22:09:21 HOSTNAME kernel: amdgpu: cp is busy, skip halt cp
Jul 25 22:09:22 HOSTNAME kernel: amdgpu: rlc is busy, skip halt rlc
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: BACO reset
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset
succeeded, trying to resume
Jul 25 22:09:22 HOSTNAME kernel: [drm] PCIE GART of 1024M enabled (table at
0x000000F400800000).
Jul 25 22:09:22 HOSTNAME kernel: [drm] VRAM is lost due to GPU reset!
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0:
[drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.2.0 test failed
(-110)
Jul 25 22:09:22 HOSTNAME kernel: [drm] UVD initialized successfully.
Jul 25 22:09:22 HOSTNAME kernel: [drm] VCE initialized successfully.
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo
from shadow start
Jul 25 22:09:22 HOSTNAME mpv[5307]: amdgpu: The CS has cancelled because the
context is lost. This context is innocent.
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo
from shadow done
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset(2)
succeeded!
Jul 25 22:09:22 HOSTNAME systemd-coredump[5681]: Process 5307 (mpv) of user
1000 terminated abnormally with signal 6/ABRT, processing...
Jul 25 22:09:22 HOSTNAME systemd[1]: Created slice Slice
/system/drkonqi-coredump-processor.
-- Subject: A start job for unit system-drkonqi\x2dcoredump\x2dprocessor.slice
has finished successfully

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.



[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux