Re: [PATCH 1/5] drm/amdgpu: Introduce gfx software ring (v8)

Michel Dänzer <michel@xxxxxxxxxxx> · Mon, 31 Oct 2022 12:58:27 +0100

On 2022-10-31 09:10, Zhu, Jiadong wrote:
> [AMD Official Use Only - General]
> 
> Hi Michel,
> 
> Sorry for the late response. It is more likely the null pointer is raised from function amdgpu_ring_preempt_ib as preempt_ib is not assigned.

That makes sense, since amdgpu_mcbp_trigger_preempt passes mux->real_ring to amdgpu_ring_preempt_ib, and the real ring doesn't have the preempt_ib hook set, does it?

> Btw, Which branch of kmd are you cherry-pick? Maybe my code base is too old.
> I tried using wayland on ubuntu 20.04 and not reproduced the crash.

The Mesa radeonsi driver in Ubuntu 20.04 didn't support the EGL_IMG_context_priority extension yet. Does eglinfo list that extension as supported by the EGL Device platform on your system?

> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@xxxxxxx>
> Sent: Thursday, October 20, 2022 10:59 PM
> To: Michel Dänzer <michel@xxxxxxxxxxx>; Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx>; Tuikov, Luben <Luben.Tuikov@xxxxxxx>
> Subject: Re: [PATCH 1/5] drm/amdgpu: Introduce gfx software ring (v8)
> 
> Am 20.10.22 um 16:49 schrieb Michel Dänzer:
>> On 2022-10-18 11:08, jiadong.zhu@xxxxxxx wrote:
>>> From: "Jiadong.Zhu" <Jiadong.Zhu@xxxxxxx>
>>>
>>> The software ring is created to support priority context while there
>>> is only one hardware queue for gfx.
>>>
>>> Every software ring has its fence driver and could be used as an
>>> ordinary ring for the GPU scheduler.
>>> Multiple software rings are bound to a real ring with the ring muxer.
>>> The packages committed on the software ring are copied to the real ring.
>>>
>>> v2: Use array to store software ring entry.
>>> v3: Remove unnecessary prints.
>>> v4: Remove amdgpu_ring_sw_init/fini functions, using gtt for sw ring
>>> buffer for later dma copy optimization.
>>> v5: Allocate ring entry dynamically in the muxer.
>>> v6: Update comments for the ring muxer.
>>> v7: Modify for function naming.
>>> v8: Combine software ring functions into amdgpu_ring_mux.c
>> I tested patches 1-4 of this series (since Christian clearly nacked patch 5). I hit the oops below.
> 
> As long as you don't try to reset the GPU you can also test patch 5.
> It's just that we can't upstream the stuff like this or that would break immediately.
> 
> Regards,
> Christian.
> 
>>
>> amdgpu_sw_ring_ib_begin+0x70/0x80 is in amdgpu_mcbp_trigger_preempt according to scripts/faddr2line, specifically line 376:
>>
>>       spin_unlock(&mux->lock);
>>
>> though I'm not sure why that would crash.
>>
>>
>> Are you not able to reproduce this with a GNOME Wayland session?
>>
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>> #PF: supervisor instruction fetch in kernel mode
>> #PF: error_code(0x0010) - not-present page PGD 0 P4D 0
>> Oops: 0010 [#1] PREEMPT SMP NOPTI
>> CPU: 7 PID: 281 Comm: gfx_high Tainted: G            E      6.0.2+ #1
>> Hardware name: LENOVO 20NF0000GE/20NF0000GE, BIOS R11ET36W (1.16 )
>> 03/30/2020
>> RIP: 0010:0x0
>> Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
>> RSP: 0018:ffffbd594073bdc8 EFLAGS: 00010246
>> RAX: 0000000000000000 RBX: ffff993c4a620000 RCX: 0000000000000000
>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff993c4a62a350
>> RBP: ffff993c4a62d530 R08: 0000000000000000 R09: 0000000000000000
>> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000114
>> R13: ffff993c4a620000 R14: 0000000000000000 R15: ffff993c4a62d128
>> FS:  0000000000000000(0000) GS:ffff993ef0bc0000(0000)
>> knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: ffffffffffffffd6 CR3: 00000001959fc000 CR4: 00000000003506e0 Call
>> Trace:
>>   <TASK>
>>   amdgpu_sw_ring_ib_begin+0x70/0x80 [amdgpu]
>>   amdgpu_ib_schedule+0x15f/0x5d0 [amdgpu]
>>   amdgpu_job_run+0x102/0x1c0 [amdgpu]
>>   drm_sched_main+0x19a/0x440 [gpu_sched]
>>   ? dequeue_task_stop+0x70/0x70
>>   ? drm_sched_resubmit_jobs+0x10/0x10 [gpu_sched]
>>   kthread+0xe9/0x110
>>   ? kthread_complete_and_exit+0x20/0x20
>>   ret_from_fork+0x22/0x30
>>   </TASK>
>> [...]
>> note: gfx_high[281] exited with preempt_count 1 [...]
>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout,
>> signaled seq=14864, emitted seq=14865 [drm:amdgpu_job_timedout
>> [amdgpu]] *ERROR* Process information: process firefox.dpkg-di pid 3540 thread firefox:cs0 pid 4666 amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
>>
>>
> 

-- 
Earthling Michel Dänzer            |                  https://redhat.com
Libre software enthusiast          |         Mesa and Xwayland developer