Re: [PATCH 1/5] drm/amdgpu: Introduce gfx software ring (v8)

Michel Dänzer <michel@xxxxxxxxxxx> · Thu, 20 Oct 2022 16:49:12 +0200

On 2022-10-18 11:08, jiadong.zhu@xxxxxxx wrote:
> From: "Jiadong.Zhu" <Jiadong.Zhu@xxxxxxx>
> 
> The software ring is created to support priority context while there is only
> one hardware queue for gfx.
> 
> Every software ring has its fence driver and could be used as an ordinary ring
> for the GPU scheduler.
> Multiple software rings are bound to a real ring with the ring muxer. The
> packages committed on the software ring are copied to the real ring.
> 
> v2: Use array to store software ring entry.
> v3: Remove unnecessary prints.
> v4: Remove amdgpu_ring_sw_init/fini functions,
> using gtt for sw ring buffer for later dma copy
> optimization.
> v5: Allocate ring entry dynamically in the muxer.
> v6: Update comments for the ring muxer.
> v7: Modify for function naming.
> v8: Combine software ring functions into amdgpu_ring_mux.c

I tested patches 1-4 of this series (since Christian clearly nacked patch 5). I hit the oops below.

amdgpu_sw_ring_ib_begin+0x70/0x80 is in amdgpu_mcbp_trigger_preempt according to scripts/faddr2line, specifically line 376:

	spin_unlock(&mux->lock);

though I'm not sure why that would crash.

Are you not able to reproduce this with a GNOME Wayland session?

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD 0 P4D 0
Oops: 0010 [#1] PREEMPT SMP NOPTI
CPU: 7 PID: 281 Comm: gfx_high Tainted: G            E      6.0.2+ #1
Hardware name: LENOVO 20NF0000GE/20NF0000GE, BIOS R11ET36W (1.16 ) 03/30/2020
RIP: 0010:0x0
Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
RSP: 0018:ffffbd594073bdc8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff993c4a620000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff993c4a62a350
RBP: ffff993c4a62d530 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000114
R13: ffff993c4a620000 R14: 0000000000000000 R15: ffff993c4a62d128
FS:  0000000000000000(0000) GS:ffff993ef0bc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000001959fc000 CR4: 00000000003506e0
Call Trace:
 <TASK>
 amdgpu_sw_ring_ib_begin+0x70/0x80 [amdgpu]
 amdgpu_ib_schedule+0x15f/0x5d0 [amdgpu]
 amdgpu_job_run+0x102/0x1c0 [amdgpu]
 drm_sched_main+0x19a/0x440 [gpu_sched]
 ? dequeue_task_stop+0x70/0x70
 ? drm_sched_resubmit_jobs+0x10/0x10 [gpu_sched]
 kthread+0xe9/0x110
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x22/0x30
 </TASK>
[...]
note: gfx_high[281] exited with preempt_count 1
[...]
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=14864, emitted seq=14865
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox.dpkg-di pid 3540 thread firefox:cs0 pid 4666
amdgpu 0000:05:00.0: amdgpu: GPU reset begin!

-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer