Re: [PATCH] drm/sched: Check scheduler work queue before calling timeout handling

Luben Tuikov <luben.tuikov@xxxxxxx> · Wed, 10 May 2023 10:31:36 -0400

On 2023-05-10 10:24, vitaly prosyak wrote:
> 
> On 2023-05-10 10:19, Luben Tuikov wrote:
>> On 2023-05-10 09:51, vitaly.prosyak@xxxxxxx wrote:
>>> From: Vitaly Prosyak <vitaly.prosyak@xxxxxxx>
>>>
>>> During an IGT GPU reset test we see again oops despite of
>>> commit 0c8c901aaaebc9 (drm/sched: Check scheduler ready before calling
>>> timeout handling).
>>>
>>> It uses ready condition whether to call drm_sched_fault which unwind
>>> the TDR leads to GPU reset.
>>> However it looks the ready condition is overloaded with other meanings,
>>> for example, for the following stack is related GPU reset :
>>>
>>> 0  gfx_v9_0_cp_gfx_start
>>> 1  gfx_v9_0_cp_gfx_resume
>>> 2  gfx_v9_0_cp_resume
>>> 3  gfx_v9_0_hw_init
>>> 4  gfx_v9_0_resume
>>> 5  amdgpu_device_ip_resume_phase2
>>>
>>> does the following:
>>> 	/* start the ring */
>>> 	gfx_v9_0_cp_gfx_start(adev);
>>> 	ring->sched.ready = true;
>>>
>>> The same approach is for other ASICs as well :
>>> gfx_v8_0_cp_gfx_resume
>>> gfx_v10_0_kiq_resume, etc...
>>>
>>> As a result, our GPU reset test causes GPU fault which calls unconditionally gfx_v9_0_fault
>>> and then drm_sched_fault. However now it depends on whether the interrupt service routine
>>> drm_sched_fault is executed after gfx_v9_0_cp_gfx_start is completed which sets the ready
>>> field of the scheduler to true even  for uninitialized schedulers and causes oops vs
>>> no fault or when ISR  drm_sched_fault is completed prior  gfx_v9_0_cp_gfx_start and
>>> NULL pointer dereference does not occur.
>>>
>>> Use the field timeout_wq  to prevent oops for uninitialized schedulers.
>>> The field could be initialized by the work queue of resetting the domain.
>>>
>>> Fixes: 0c8c901aaaebc9 ("drm/sched: Check scheduler ready before calling timeout handling")
>>>
>>> v1: Corrections to commit message (Luben)
>>> Signed-off-by: Vitaly Prosyak <vitaly.prosyak@xxxxxxx>
>>> Reviewed-by: Luben Tuikov <luben.tuikov@xxxxxxx>
>> I didn't give my RB to this patch so I'm not sure what it is doing here.
> I removed your rb, also if you do not know what is doing here why do you want to push this to amd-staging-drm-next and to  drm-misc-fixed?

I'll add my RB as I push it to those two branches.
I'll also add a Link tag and fix the commit SHA for the Fixes tag to
one which is found in drm-misc-fixes.

Thanks for the patch fixing this long-standing bug.

Regards,
Luben

>>
>> The fixes tag should be before the SOB tag, and the v1 line should be separated
>> by a line before the Git tags.
>>
>> Since this is a good patch and I want it in both drm-misc-fixed and amd-staging-drm-next,
>> I'll submit it to drm-misc-fixed with a Link: and RB/SOB tag there and then cherry-pick
>> that into amd-staging-drm-next.
>>
>> Don't push it to amd-staging-drm-next.
>>
>> I'll fix this and submit to amd-staging-drm-next and to drm-misc-fixed with
>> a Link: tag.
>>
>> Regards,
>> Luben
>>
>>
>>> ---
>>>  drivers/gpu/drm/scheduler/sched_main.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
>>> index 649fac2e1ccb..670b7997f389 100644
>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>> @@ -308,7 +308,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
>>>   */
>>>  void drm_sched_fault(struct drm_gpu_scheduler *sched)
>>>  {
>>> -	if (sched->ready)
>>> +	if (sched->timeout_wq)
>>>  		mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
>>>  }
>>>  EXPORT_SYMBOL(drm_sched_fault);