On 2023-05-10 10:19, Luben Tuikov wrote: > On 2023-05-10 09:51, vitaly.prosyak@xxxxxxx wrote: >> From: Vitaly Prosyak <vitaly.prosyak@xxxxxxx> >> >> During an IGT GPU reset test we see again oops despite of >> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling >> timeout handling). >> >> It uses ready condition whether to call drm_sched_fault which unwind >> the TDR leads to GPU reset. >> However it looks the ready condition is overloaded with other meanings, >> for example, for the following stack is related GPU reset : >> >> 0 gfx_v9_0_cp_gfx_start >> 1 gfx_v9_0_cp_gfx_resume >> 2 gfx_v9_0_cp_resume >> 3 gfx_v9_0_hw_init >> 4 gfx_v9_0_resume >> 5 amdgpu_device_ip_resume_phase2 >> >> does the following: >> /* start the ring */ >> gfx_v9_0_cp_gfx_start(adev); >> ring->sched.ready = true; >> >> The same approach is for other ASICs as well : >> gfx_v8_0_cp_gfx_resume >> gfx_v10_0_kiq_resume, etc... >> >> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault >> and then drm_sched_fault. However now it depends on whether the interrupt service routine >> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready >> field of the scheduler to true even for uninitialized schedulers and causes oops vs >> no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and >> NULL pointer dereference does not occur. >> >> Use the field timeout_wq to prevent oops for uninitialized schedulers. >> The field could be initialized by the work queue of resetting the domain. >> >> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling") >> >> v1: Corrections to commit message (Luben) >> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@xxxxxxx> >> Reviewed-by: Luben Tuikov <luben.tuikov@xxxxxxx> > I didn't give my RB to this patch so I'm not sure what it is doing here. I removed your rb, also if you do not know what is doing here why do you want to push this to amd-staging-drm-next and to drm-misc-fixed? > > The fixes tag should be before the SOB tag, and the v1 line should be separated > by a line before the Git tags. > > Since this is a good patch and I want it in both drm-misc-fixed and amd-staging-drm-next, > I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then cherry-pick > that into amd-staging-drm-next. > > Don't push it to amd-staging-drm-next. > > I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with > a Link: tag. > > Regards, > Luben > > >> --- >> drivers/gpu/drm/scheduler/sched_main.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >> index 649fac2e1ccb..670b7997f389 100644 >> --- a/drivers/gpu/drm/scheduler/sched_main.c >> +++ b/drivers/gpu/drm/scheduler/sched_main.c >> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) >> */ >> void drm_sched_fault(struct drm_gpu_scheduler *sched) >> { >> - if (sched->ready) >> + if (sched->timeout_wq) >> mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0); >> } >> EXPORT_SYMBOL(drm_sched_fault);