On 2021-11-05 1:59 p.m., Liu, Shaoyun wrote:
[AMD Official Use Only]
Ye, a lot already been changed since then , now the pre_reset and post_reset not in the lock/unlock anymore. With my previous change , we make kfd_pre_reset avoid touch HW . Now it's pure SW handling , should be safe to be moved out of the full access .
Anyway, thanks to bring this up, it will remind us to verify on the XGMI configuration on SRIOV.
OK. Assuming it doesn't break in your testing, consider the patch
Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx>
Regards
shaoyun.liu
-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: Friday, November 5, 2021 1:48 PM
To: Liu, Shaoyun <Shaoyun.Liu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
There was a reason why pre_reset was done differently on SRIOV. However, the code has changed a lot since then. Is this concern still valid?
commit 7b184b006185215daf4e911f8de212964c99a514
Author: wentalou <Wentao.Lou@xxxxxxx>
Date: Fri Dec 7 13:53:18 2018 +0800
drm/amdgpu: kfd_pre_reset outside req_full_gpu cause sriov hang
XGMI hive put kfd_pre_reset into amdgpu_device_lock_adev,
but outside req_full_gpu of sriov.
It would make sriov hang during reset.
Signed-off-by: Wentao Lou <Wentao.Lou@xxxxxxx>
Reviewed-by: Shaoyun Liu <Shaoyun.Liu@xxxxxxx>
Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
Regards,
Felix
On 2021-11-05 12:57 p.m., shaoyunl wrote:
The KFD pre_reset should be called before reset been executed, it will
hold the lock to prevent other rocm process to sent the packlage to
hiq during host execute the real reset on the HW
Signed-off-by: shaoyunl <shaoyun.liu@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 95fec36e385e..d7c9dce17cad 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4278,8 +4278,6 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
if (r)
return r;
- amdgpu_amdkfd_pre_reset(adev);
-
/* Resume IP prior to SMC */
r = amdgpu_device_ip_reinit_early_sriov(adev);
if (r)
@@ -5015,8 +5013,7 @@ int amdgpu_device_gpu_recover(struct
amdgpu_device *adev,
cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
- if (!amdgpu_sriov_vf(tmp_adev))
- amdgpu_amdkfd_pre_reset(tmp_adev);
+ amdgpu_amdkfd_pre_reset(tmp_adev);
/*
* Mark these ASICs to be reseted as untracked first