On 2023-05-10 10:24, vitaly prosyak wrote: > > On 2023-05-10 10:19, Luben Tuikov wrote: >> On 2023-05-10 09:51, vitaly.prosyak@xxxxxxx wrote: >>> From: Vitaly Prosyak <vitaly.prosyak@xxxxxxx> >>> >>> During an IGT GPU reset test we see again oops despite of >>> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling >>> timeout handling). >>> >>> It uses ready condition whether to call drm_sched_fault which unwind >>> the TDR leads to GPU reset. >>> However it looks the ready condition is overloaded with other meanings, >>> for example, for the following stack is related GPU reset : >>> >>> 0 gfx_v9_0_cp_gfx_start >>> 1 gfx_v9_0_cp_gfx_resume >>> 2 gfx_v9_0_cp_resume >>> 3 gfx_v9_0_hw_init >>> 4 gfx_v9_0_resume >>> 5 amdgpu_device_ip_resume_phase2 >>> >>> does the following: >>> /* start the ring */ >>> gfx_v9_0_cp_gfx_start(adev); >>> ring->sched.ready = true; >>> >>> The same approach is for other ASICs as well : >>> gfx_v8_0_cp_gfx_resume >>> gfx_v10_0_kiq_resume, etc... >>> >>> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault >>> and then drm_sched_fault. However now it depends on whether the interrupt service routine >>> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready >>> field of the scheduler to true even for uninitialized schedulers and causes oops vs >>> no fault or when ISR drm_sched_fault is completed prior gfx_v9_0_cp_gfx_start and >>> NULL pointer dereference does not occur. >>> >>> Use the field timeout_wq to prevent oops for uninitialized schedulers. >>> The field could be initialized by the work queue of resetting the domain. >>> >>> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling") >>> >>> v1: Corrections to commit message (Luben) >>> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@xxxxxxx> >>> Reviewed-by: Luben Tuikov <luben.tuikov@xxxxxxx> >> I didn't give my RB to this patch so I'm not sure what it is doing here. > I removed your rb, also if you do not know what is doing here why do you want to push this to amd-staging-drm-next and to drm-misc-fixed? I'll add my RB as I push it to those two branches. I'll also add a Link tag and fix the commit SHA for the Fixes tag to one which is found in drm-misc-fixes. Thanks for the patch fixing this long-standing bug. Regards, Luben >> >> The fixes tag should be before the SOB tag, and the v1 line should be separated >> by a line before the Git tags. >> >> Since this is a good patch and I want it in both drm-misc-fixed and amd-staging-drm-next, >> I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then cherry-pick >> that into amd-staging-drm-next. >> >> Don't push it to amd-staging-drm-next. >> >> I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with >> a Link: tag. >> >> Regards, >> Luben >> >> >>> --- >>> drivers/gpu/drm/scheduler/sched_main.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c >>> index 649fac2e1ccb..670b7997f389 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_main.c >>> +++ b/drivers/gpu/drm/scheduler/sched_main.c >>> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched) >>> */ >>> void drm_sched_fault(struct drm_gpu_scheduler *sched) >>> { >>> - if (sched->ready) >>> + if (sched->timeout_wq) >>> mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0); >>> } >>> EXPORT_SYMBOL(drm_sched_fault);