Re: [PATCH 5/5] drm/amdgpu: reduce reset time

Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx> · Mon, 25 Jul 2022 17:17:43 -0400

On 2022-07-22 03:34, Victor Zhao wrote:
In multi container use case, reset time is important, so skip ring
tests and cp halt wait during ip suspending for reset as they are
going to fail and cost more time on reset


Why are they failing in this case ? Skipping ring tests is not the best 
idea as you
loose important indicator of system's sanity. Is there any way to make 
them work ?



Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c |  2 +-
  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c  | 26 +++++++++++++++----------
  2 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
index 222d3d7ea076..f872495ccc3a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c
@@ -477,7 +477,7 @@ int amdgpu_gfx_disable_kcq(struct amdgpu_device *adev)
  		kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.compute_ring[i],
  					   RESET_QUEUES, 0, 0);
  
-	if (adev->gfx.kiq.ring.sched.ready)
+	if (adev->gfx.kiq.ring.sched.ready && !amdgpu_in_reset(adev))
  		r = amdgpu_ring_test_helper(kiq_ring);
  	spin_unlock(&adev->gfx.kiq.ring_lock);
  
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
index fafbad3cf08d..9ae29023e38f 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c
@@ -5971,16 +5971,19 @@ static int gfx_v10_0_cp_gfx_enable(struct amdgpu_device *adev, bool enable)
  		WREG32_SOC15(GC, 0, mmCP_ME_CNTL, tmp);
  	}
  
-	for (i = 0; i < adev->usec_timeout; i++) {
-		if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
-			break;
-		udelay(1);
-	}
-
-	if (i >= adev->usec_timeout)
-		DRM_ERROR("failed to %s cp gfx\n", enable ? "unhalt" : "halt");
+	if (!amdgpu_in_reset(adev)) {
+		for (i = 0; i < adev->usec_timeout; i++) {
+			if (RREG32_SOC15(GC, 0, mmCP_STAT) == 0)
+				break;
+			udelay(1);
+		}
  
+		if (i >= adev->usec_timeout)
+			DRM_ERROR("failed to %s cp gfx\n",
+				  enable ? "unhalt" : "halt");
+	}
  	return 0;
+
  }


This change has impact beyond container case no ? We had no issue
with this code during regular reset cases so why we would give up on 
this code
which confirms CP is idle ? What is the side effect of skipping this 
during all GPU resets ?

Andrey


  
  static int gfx_v10_0_cp_gfx_load_pfp_microcode(struct amdgpu_device *adev)
@@ -7569,8 +7572,10 @@ static int gfx_v10_0_kiq_disable_kgq(struct amdgpu_device *adev)
  	for (i = 0; i < adev->gfx.num_gfx_rings; i++)
  		kiq->pmf->kiq_unmap_queues(kiq_ring, &adev->gfx.gfx_ring[i],
  					   PREEMPT_QUEUES, 0, 0);
-
-	return amdgpu_ring_test_helper(kiq_ring);
+	if (!amdgpu_in_reset(adev))
+		return amdgpu_ring_test_helper(kiq_ring);
+	else
+		return 0;
  }
  #endif
  
@@ -7610,6 +7615,7 @@ static int gfx_v10_0_hw_fini(void *handle)
  
  		return 0;
  	}
+
  	gfx_v10_0_cp_enable(adev, false);
  	gfx_v10_0_enable_gui_idle_interrupt(adev, false);