Re: [PATCH] drm/amdgpu: Fix two reset triggered in a row

Christian König <christian.koenig@xxxxxxx> · Tue, 23 Apr 2024 07:50:46 +0200

Am 22.04.24 um 21:45 schrieb Yunxiang Li:
Reset request from KFD is missing a check for if a reset is already in
progress, this causes a second reset to be triggered right after the
previous one finishes. Add the check to align with the other reset sources.

NAK, that isn't how this should be handled.

Instead all reset source which are handled by a previous reset should be 
canceled.

In other words there should be a cancel_work(&adev->kfd.reset_work); 
somewhere in the KFD code. When this doesn't work correctly then that is 
somehow missing.

If you see the use of amdgpu_in_reset() outside of the low level 
functions than that is clearly a bug.

Regards,
Christian.


Signed-off-by: Yunxiang Li <Yunxiang.Li@xxxxxxx>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
index 3b4591f554f1..ce3dbb1cc2da 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -283,7 +283,7 @@ int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev)
  
  void amdgpu_amdkfd_gpu_reset(struct amdgpu_device *adev)
  {
-	if (amdgpu_device_should_recover_gpu(adev))
+	if (amdgpu_device_should_recover_gpu(adev) && !amdgpu_in_reset(adev))
  		amdgpu_reset_domain_schedule(adev->reset_domain,
  					     &adev->kfd.reset_work);
  }