Re: [PATCH] amd/amdgpu: Reduce unnecessary repetitive GPU resets

Christian König <christian.koenig@xxxxxxx> · Fri, 20 Sep 2024 12:37:12 +0200

Am 20.09.24 um 09:36 schrieb YiPeng Chai:
In multiple GPUs case, after a GPU has started
resetting all GPUs on hive, other GPUs do not
need to trigger GPU reset again.

Please drop any such handling. GPU resets in a hive are serialized using 
a single thread workqueue.

If you want to prevent multiple GPU resets you just need to cancel other 
queued up resets before or after resetting the hive.

This handling here just duplicates this logic and is therefore a clear 
NAK from my side.

Regards,
Christian.


Signed-off-by: YiPeng Chai <YiPeng.Chai@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 21 ++++++++++++++++++++-
  1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index dbfc41ddc3c7..7d48541b03d8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -4306,8 +4306,27 @@ int amdgpu_ras_reset_gpu(struct amdgpu_device *adev)
  		ras->gpu_reset_flags |= AMDGPU_RAS_GPU_RESET_MODE1_RESET;
  	}
  
-	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0)
+	if (atomic_cmpxchg(&ras->in_recovery, 0, 1) == 0) {
+		struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev);
+		int hive_ras_recovery = 0;
+
+		if (hive) {
+			hive_ras_recovery = atomic_read(&hive->ras_recovery);
+			amdgpu_put_xgmi_hive(hive);
+		}
+		/* In the case of multiple GPUs, after a GPU has started
+		 * resetting all GPUs on hive, other GPUs do not need to
+		 * trigger GPU reset again.
+		 */
+		if (!hive_ras_recovery)
+			amdgpu_reset_domain_schedule(ras->adev->reset_domain, &ras->recovery_work);
+		else
+			atomic_set(&ras->in_recovery, 0);
+	} else {
+		flush_work(&ras->recovery_work);
  		amdgpu_reset_domain_schedule(ras->adev->reset_domain, &ras->recovery_work);
+	}
+
  	return 0;
  }