RE: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

"Liu, Shaoyun" <Shaoyun.Liu@xxxxxxx> · Thu, 23 Dec 2021 18:07:51 +0000

[AMD Official Use Only]

I have  a discussion with  Andrey  about this offline.   It seems dangerous  to remove the in_gpu_reset and  reset_semm directly inside the  flr_work.  In the case when the reset is triggered from host side , gpu need to be locked while host perform reset after flr_work reply the host with  READY_TO_RESET. 
The original comments seems need to be updated. 

Regards
Shaoyun.liu
 

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Andrey Grodzovsky
Sent: Wednesday, December 22, 2021 5:14 PM
To: dri-devel@xxxxxxxxxxxxxxxxxxxxx; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Liu, Monk <Monk.Liu@xxxxxxx>; Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>; Chen, Horace <Horace.Chen@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; daniel@xxxxxxxx
Subject: [RFC v2 8/8] drm/amd/virt: Drop concurrent GPU reset protection for SRIOV

Since now flr work is serialized against  GPU resets there is no need for this.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 11 -----------  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 11 -----------
 2 files changed, 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index 487cd654b69e..7d59a66e3988 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -248,15 +248,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 	int timeout = AI_MAILBOX_POLL_FLR_TIMEDOUT;
 
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (!down_write_trylock(&adev->reset_sem))
-		return;
-
 	amdgpu_virt_fini_data_exchange(adev);
-	atomic_set(&adev->in_gpu_reset, 1);
 
 	xgpu_ai_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -269,9 +261,6 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	} while (timeout > 1);
 
 flr_done:
-	atomic_set(&adev->in_gpu_reset, 0);
-	up_write(&adev->reset_sem);
-
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) || diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
index e3869067a31d..f82c066c8e8d 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c
@@ -277,15 +277,7 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 	struct amdgpu_device *adev = container_of(virt, struct amdgpu_device, virt);
 	int timeout = NV_MAILBOX_POLL_FLR_TIMEDOUT;
 
-	/* block amdgpu_gpu_recover till msg FLR COMPLETE received,
-	 * otherwise the mailbox msg will be ruined/reseted by
-	 * the VF FLR.
-	 */
-	if (!down_write_trylock(&adev->reset_sem))
-		return;
-
 	amdgpu_virt_fini_data_exchange(adev);
-	atomic_set(&adev->in_gpu_reset, 1);
 
 	xgpu_nv_mailbox_trans_msg(adev, IDH_READY_TO_RESET, 0, 0, 0);
 
@@ -298,9 +290,6 @@ static void xgpu_nv_mailbox_flr_work(struct work_struct *work)
 	} while (timeout > 1);
 
 flr_done:
-	atomic_set(&adev->in_gpu_reset, 0);
-	up_write(&adev->reset_sem);
-
 	/* Trigger recovery for world switch failure if no TDR */
 	if (amdgpu_device_should_recover_gpu(adev)
 		&& (!amdgpu_device_has_job_running(adev) ||
--
2.25.1