Re: [Patch V2] drm/amdgpu: Increase tlb flush timeout for sriov

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Wed, 10 Aug 2022 18:51:33 +0200

Am 10.08.22 um 10:50 schrieb Dusica Milinkovic:
[Why]
During multi-vf executing benchmark (Luxmark) observed kiq error timeout.
It happenes because all of VFs do the tlb invalidation at the same time.
Although each VF has the invalidate register set, from hardware side
the invalidate requests are queue to execute.

[How]
In case of 12 VF increase timeout on 12*100ms

Signed-off-by: Dusica Milinkovic <Dusica.Milinkovic@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +++++-
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 +++++-
  2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
index 9ae8cdaa033e..5743975efea5 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
@@ -419,6 +419,7 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
  	uint32_t seq;
  	uint16_t queried_pasid;
  	bool ret;
+	uint32_t sriov_usec_timeout = 1200000;  /* wait for 12 * 100ms for SRIOV */

Please put that as a define into some header and never ever write 
comments at the same line after a define.



  	struct amdgpu_ring *ring = &adev->gfx.kiq.ring;
  	struct amdgpu_kiq *kiq = &adev->gfx.kiq;
  
@@ -437,7 +438,10 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
  
  		amdgpu_ring_commit(ring);
  		spin_unlock(&adev->gfx.kiq.ring_lock);
-		r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
+		if (amdgpu_sriov_vf(adev))
+			r = amdgpu_fence_wait_polling(ring, seq, sriov_usec_timeout);
+		else
+			r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);

Don't duplicate the whole call, just change the parameter.

Regards,
Christian.

  		if (r < 1) {
  			dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r);
  			return -ETIME;
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index ab89d91975ab..bab26982b3f9 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -896,6 +896,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
  	uint32_t seq;
  	uint16_t queried_pasid;
  	bool ret;
+	uint32_t sriov_usec_timeout = 1200000;  /* wait for 12 * 100ms for SRIOV */
  	struct amdgpu_ring *ring = &adev->gfx.kiq.ring;
  	struct amdgpu_kiq *kiq = &adev->gfx.kiq;
  
@@ -935,7 +936,10 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct amdgpu_device *adev,
  
  		amdgpu_ring_commit(ring);
  		spin_unlock(&adev->gfx.kiq.ring_lock);
-		r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
+		if (amdgpu_sriov_vf(adev))
+			r = amdgpu_fence_wait_polling(ring, seq, sriov_usec_timeout);
+		else
+			r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
  		if (r < 1) {
  			dev_err(adev->dev, "wait for kiq fence error: %ld.\n", r);
  			up_read(&adev->reset_domain->sem);