Re: [PATCH] drm/sched: Check scheduler work queue before calling timeout handling

Luben Tuikov <luben.tuikov@xxxxxxx> · Wed, 10 May 2023 06:40:49 -0400

On 2023-05-09 17:43, vitaly.prosyak@xxxxxxx wrote:
> From: Vitaly Prosyak <vitaly.prosyak@xxxxxxx>
> 
> During an IGT GPU reset test we see again oops despite of
> commit 0c8c901aaaebc9bf8bf189ffc116e678f7a2dc16
> drm/sched: Check scheduler ready before calling timeout handling.

You can probably use the more succinct fixes line:
0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling")

> 
> It uses ready condition whether to call drm_sched_fault which unwind
> the TDR leads to GPU reset.
> However it looks the ready condition is overloaded with other meanings,
> for example, for the following stack is related GPU reset :
> 
> 0  gfx_v9_0_cp_gfx_start
> 1  gfx_v9_0_cp_gfx_resume
> 2  gfx_v9_0_cp_resume
> 3  gfx_v9_0_hw_init
> 4  gfx_v9_0_resume
> 5  amdgpu_device_ip_resume_phase2
> 
> does the following:
> 	/* start the ring */
> 	gfx_v9_0_cp_gfx_start(adev);
> 	ring->sched.ready = true;
> 
> The same approach is for other ASICs as well :
> gfx_v8_0_cp_gfx_resume
> gfx_v10_0_kiq_resume, etc...
> 
> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
> and then drm_sched_fault. However now it depends on whether the interrupt service routine
> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
> field of the scheduler to true even  for not initialized schedulers and causes oops vs

"not initialized" --> "uninitialized" reads better.

> no fault or when ISR  drm_sched_fault is completed prior  gfx_v9_0_cp_gfx_start and
> NULL pointer dereference does not occur.
> 
> Use the field timeout_wq  to prevent oops for uninitialized schedulers.
> The field could be initialized by the work queue of resetting the domain.
> 
> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@xxxxxxx>

Add, a fixes tag,

Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling")

Before the SOB tag.

> ---
>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index 649fac2e1ccb..670b7997f389 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>   */
>  void drm_sched_fault(struct drm_gpu_scheduler *sched)
>  {
> -	if (sched->ready)
> +	if (sched->timeout_wq)
>  		mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
>  }
>  EXPORT_SYMBOL(drm_sched_fault);

Yes, this does indeed seem more correct.

Apply the comments above and repost the patch to amd-gfx and dri-devel and
I'll push it to drm-misc-fixes and amd-staging-drm-next.
-- 
Regards,
Luben