Re: [PATCH] drm/amdgpu: Fix two reset triggered in a row

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Tue, 23 Apr 2024 08:36:02 +0530

On 4/23/2024 1:15 AM, Yunxiang Li wrote:
> Reset request from KFD is missing a check for if a reset is already in
> progress, this causes a second reset to be triggered right after the
> previous one finishes. Add the check to align with the other reset sources.
> 
> Signed-off-by: Yunxiang Li <Yunxiang.Li@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> index 3b4591f554f1..ce3dbb1cc2da 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
> @@ -283,7 +283,7 @@ int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev)
>  
>  void amdgpu_amdkfd_gpu_reset(struct amdgpu_device *adev)
>  {
> -	if (amdgpu_device_should_recover_gpu(adev))
> +	if (amdgpu_device_should_recover_gpu(adev) && !amdgpu_in_reset(adev))
>  		amdgpu_reset_domain_schedule(adev->reset_domain,
>  					     &adev->kfd.reset_work);

We can't do this technically as there are cases where we skip full
device reset (even then amdgpu_in_reset will return true). The better
thing to do is to move amdgpu_device_stop_pending_resets() later in
gpu_recover()- if a device has undergone full reset, then cancel all
pending resets. Presently it's happening earlier which could be why this
issue is seen.

Thanks,
Lijo

>  }