RE: [PATCH] drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov

"Liu, Shaoyun" <Shaoyun.Liu@xxxxxxx> · Fri, 5 Nov 2021 17:59:28 +0000

[AMD Official Use Only]

Ye, a lot already been changed since then , now the  pre_reset and  post_reset not in the  lock/unlock anymore.  With  my previous change , we make kfd_pre_reset  avoid touch  HW . Now it's pure SW handling , should be safe  to be moved out of the full access . 
Anyway, thanks to bring this up, it will remind us to verify on the  XGMI configuration on SRIOV. 

Regards
shaoyun.liu 

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> 
Sent: Friday, November 5, 2021 1:48 PM
To: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov

There was a reason why pre_reset was done differently on SRIOV. However, the code has changed a lot since then. Is this concern still valid?

> commit 7b184b006185215daf4e911f8de212964c99a514
> Author: wentalou <Wentao.Lou@xxxxxxx>
> Date:   Fri Dec 7 13:53:18 2018 +0800
>
>     drm/amdgpu: kfd_pre_reset outside req_full_gpu cause sriov hang
>
>     XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev,
>     but outside req_full_gpu of sriov.
>     It would make sriov hang during reset.
>
>     Signed-off-by: Wentao Lou <Wentao.Lou@xxxxxxx>
>     Reviewed-by: Shaoyun Liu <Shaoyun.Liu@xxxxxxx>
>     Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
Regards,
   Felix

On 2021-11-05 12:57 p.m., shaoyunl wrote:
> The KFD pre_reset should be called before reset been executed, it will 
> hold the lock to prevent other rocm process to sent the packlage to 
> hiq during host execute the real reset on the HW
>
> Signed-off-by: shaoyunl <shaoyun.liu@xxxxxxx>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +----
>   1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 95fec36e385e..d7c9dce17cad 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4278,8 +4278,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>   	if (r)
>   		return r;
>   
> -	amdgpu_amdkfd_pre_reset(adev);
> -
>   	/* Resume IP prior to SMC */
>   	r = amdgpu_device_ip_reinit_early_sriov(adev);
>   	if (r)
> @@ -5015,8 +5013,7 @@ int amdgpu_device_gpu_recover(struct 
> amdgpu_device *adev,
>   
>   		cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
>   
> -		if (!amdgpu_sriov_vf(tmp_adev))
> -			amdgpu_amdkfd_pre_reset(tmp_adev);
> +		amdgpu_amdkfd_pre_reset(tmp_adev);
>   
>   		/*
>   		 * Mark these ASICs to be reseted as untracked first