On 2023-05-09 17:43, vitaly.prosyak@xxxxxxx wrote: > From: Vitaly Prosyak <vitaly.prosyak@xxxxxxx> > > During an IGT GPU reset test we see again oops despite of > commit 0c8c901aaaebc9bf8bf189ffc116e678f7a2dc16 > drm/sched: Check scheduler ready before calling timeout handling. You can probably use the more succinct fixes line: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling") > > It uses ready condition whether to call drm_sched_fault which unwind > the TDR leads to GPU reset. > However it looks the ready condition is overloaded with other meanings, > for example, for the following stack is related GPU reset : > > 0 gfx_v9_0_cp_gfx_start > 1 gfx_v9_0_cp_gfx_resume > 2 gfx_v9_0_cp_resume > 3 gfx_v9_0_hw_init > 4 gfx_v9_0_resume > 5 amdgpu_device_ip_resume_phase2 > > does the following: > /* start the ring */ > gfx_v9_0_cp_gfx_start(adev); > ring->sched.ready = true; > > The same approach is for other ASICs as well : > gfx_v8_0_cp_gfx_resume > gfx_v10_0_kiq_resume, etc... > > As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault > and then drm_sched_fault. However now it depends on whether the interrupt service routine > drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready > field of the scheduler to true even for not initialized schedulers and causes oops vs "not initialized" --> "uninitialized" reads better. > no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and > NULL pointer dereference does not occur. > > Use the field timeout_wq to prevent oops for uninitialized schedulers. > The field could be initialized by the work queue of resetting the domain. > > Signed-off-by: Vitaly Prosyak <vitaly.prosyak@xxxxxxx> Add, a fixes tag, Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling") Before the SOB tag. > --- > drivers/gpu/drm/scheduler/sched_main.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c > index 649fac2e1ccb..670b7997f389 100644 > --- a/drivers/gpu/drm/scheduler/sched_main.c > +++ b/drivers/gpu/drm/scheduler/sched_main.c > @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) > */ > void drm_sched_fault(struct drm_gpu_scheduler *sched) > { > - if (sched->ready) > + if (sched->timeout_wq) > mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0); > } > EXPORT_SYMBOL(drm_sched_fault); Yes, this does indeed seem more correct. Apply the comments above and repost the patch to amd-gfx and dri-devel and I'll push it to drm-misc-fixes and amd-staging-drm-next. -- Regards, Luben