Re: [PATCH] drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov

Felix Kuehling <felix.kuehling@xxxxxxx> · Fri, 5 Nov 2021 17:10:50 -0400

On 2021-11-05 1:59 p.m., Liu, Shaoyun wrote:
[AMD Official Use Only]

Ye, a lot already been changed since then , now the  pre_reset and  post_reset not in the  lock/unlock anymore.  With  my previous change , we make kfd_pre_reset  avoid touch  HW . Now it's pure SW handling , should be safe  to be moved out of the full access .
Anyway, thanks to bring this up, it will remind us to verify on the  XGMI configuration on SRIOV.
OK. Assuming it doesn't break in your testing, consider the patch

Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx>



Regards
shaoyun.liu

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Friday, November 5, 2021 1:48 PM
To: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov

There was a reason why pre_reset was done differently on SRIOV. However, the code has changed a lot since then. Is this concern still valid?

commit 7b184b006185215daf4e911f8de212964c99a514
Author: wentalou <Wentao.Lou@xxxxxxx>
Date:   Fri Dec 7 13:53:18 2018 +0800

     drm/amdgpu: kfd_pre_reset outside req_full_gpu cause sriov hang

     XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev,
     but outside req_full_gpu of sriov.
     It would make sriov hang during reset.

     Signed-off-by: Wentao Lou <Wentao.Lou@xxxxxxx>
     Reviewed-by: Shaoyun Liu <Shaoyun.Liu@xxxxxxx>
     Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
Regards,
    Felix


On 2021-11-05 12:57 p.m., shaoyunl wrote:
The KFD pre_reset should be called before reset been executed, it will
hold the lock to prevent other rocm process to sent the packlage to
hiq during host execute the real reset on the HW

Signed-off-by: shaoyunl <shaoyun.liu@xxxxxxx>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +----
   1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 95fec36e385e..d7c9dce17cad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4278,8 +4278,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
   	if (r)
   		return r;
   
-	amdgpu_amdkfd_pre_reset(adev);
-
   	/* Resume IP prior to SMC */
   	r = amdgpu_device_ip_reinit_early_sriov(adev);
   	if (r)
@@ -5015,8 +5013,7 @@ int amdgpu_device_gpu_recover(struct
amdgpu_device *adev,
   
   		cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
   
-		if (!amdgpu_sriov_vf(tmp_adev))
-			amdgpu_amdkfd_pre_reset(tmp_adev);
+		amdgpu_amdkfd_pre_reset(tmp_adev);
   
   		/*
   		 * Mark these ASICs to be reseted as untracked first