On 2022-10-31 09:10, Zhu, Jiadong wrote: > [AMD Official Use Only - General] > > Hi Michel, > > Sorry for the late response. It is more likely the null pointer is raised from function amdgpu_ring_preempt_ib as preempt_ib is not assigned. That makes sense, since amdgpu_mcbp_trigger_preempt passes mux->real_ring to amdgpu_ring_preempt_ib, and the real ring doesn't have the preempt_ib hook set, does it? > Btw, Which branch of kmd are you cherry-pick? Maybe my code base is too old. > I tried using wayland on ubuntu 20.04 and not reproduced the crash. The Mesa radeonsi driver in Ubuntu 20.04 didn't support the EGL_IMG_context_priority extension yet. Does eglinfo list that extension as supported by the EGL Device platform on your system? > -----Original Message----- > From: Koenig, Christian <Christian.Koenig@xxxxxxx> > Sent: Thursday, October 20, 2022 10:59 PM > To: Michel Dänzer <michel@xxxxxxxxxxx>; Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Huang, Ray <Ray.Huang@xxxxxxx>; Tuikov, Luben <Luben.Tuikov@xxxxxxx> > Subject: Re: [PATCH 1/5] drm/amdgpu: Introduce gfx software ring (v8) > > Am 20.10.22 um 16:49 schrieb Michel Dänzer: >> On 2022-10-18 11:08, jiadong.zhu@xxxxxxx wrote: >>> From: "Jiadong.Zhu" <Jiadong.Zhu@xxxxxxx> >>> >>> The software ring is created to support priority context while there >>> is only one hardware queue for gfx. >>> >>> Every software ring has its fence driver and could be used as an >>> ordinary ring for the GPU scheduler. >>> Multiple software rings are bound to a real ring with the ring muxer. >>> The packages committed on the software ring are copied to the real ring. >>> >>> v2: Use array to store software ring entry. >>> v3: Remove unnecessary prints. >>> v4: Remove amdgpu_ring_sw_init/fini functions, using gtt for sw ring >>> buffer for later dma copy optimization. >>> v5: Allocate ring entry dynamically in the muxer. >>> v6: Update comments for the ring muxer. >>> v7: Modify for function naming. >>> v8: Combine software ring functions into amdgpu_ring_mux.c >> I tested patches 1-4 of this series (since Christian clearly nacked patch 5). I hit the oops below. > > As long as you don't try to reset the GPU you can also test patch 5. > It's just that we can't upstream the stuff like this or that would break immediately. > > Regards, > Christian. > >> >> amdgpu_sw_ring_ib_begin+0x70/0x80 is in amdgpu_mcbp_trigger_preempt according to scripts/faddr2line, specifically line 376: >> >> spin_unlock(&mux->lock); >> >> though I'm not sure why that would crash. >> >> >> Are you not able to reproduce this with a GNOME Wayland session? >> >> >> BUG: kernel NULL pointer dereference, address: 0000000000000000 >> #PF: supervisor instruction fetch in kernel mode >> #PF: error_code(0x0010) - not-present page PGD 0 P4D 0 >> Oops: 0010 [#1] PREEMPT SMP NOPTI >> CPU: 7 PID: 281 Comm: gfx_high Tainted: G E 6.0.2+ #1 >> Hardware name: LENOVO 20NF0000GE/20NF0000GE, BIOS R11ET36W (1.16 ) >> 03/30/2020 >> RIP: 0010:0x0 >> Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. >> RSP: 0018:ffffbd594073bdc8 EFLAGS: 00010246 >> RAX: 0000000000000000 RBX: ffff993c4a620000 RCX: 0000000000000000 >> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff993c4a62a350 >> RBP: ffff993c4a62d530 R08: 0000000000000000 R09: 0000000000000000 >> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000114 >> R13: ffff993c4a620000 R14: 0000000000000000 R15: ffff993c4a62d128 >> FS: 0000000000000000(0000) GS:ffff993ef0bc0000(0000) >> knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: ffffffffffffffd6 CR3: 00000001959fc000 CR4: 00000000003506e0 Call >> Trace: >> <TASK> >> amdgpu_sw_ring_ib_begin+0x70/0x80 [amdgpu] >> amdgpu_ib_schedule+0x15f/0x5d0 [amdgpu] >> amdgpu_job_run+0x102/0x1c0 [amdgpu] >> drm_sched_main+0x19a/0x440 [gpu_sched] >> ? dequeue_task_stop+0x70/0x70 >> ? drm_sched_resubmit_jobs+0x10/0x10 [gpu_sched] >> kthread+0xe9/0x110 >> ? kthread_complete_and_exit+0x20/0x20 >> ret_from_fork+0x22/0x30 >> </TASK> >> [...] >> note: gfx_high[281] exited with preempt_count 1 [...] >> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, >> signaled seq=14864, emitted seq=14865 [drm:amdgpu_job_timedout >> [amdgpu]] *ERROR* Process information: process firefox.dpkg-di pid 3540 thread firefox:cs0 pid 4666 amdgpu 0000:05:00.0: amdgpu: GPU reset begin! >> >> > -- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and Xwayland developer