Re: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery

Christian König <christian.koenig@xxxxxxx> · Thu, 20 Aug 2020 10:59:58 +0200

Yes, that is perfectly valid. Same thing for multiple timeouts from 
different queues.

Christian.

Am 20.08.20 um 10:40 schrieb Li, Dennis:
[AMD Public Use]

Hi, Hawking,
       When RAS uncorrectable error happens, RAS interrupt will trigger a GPU recovery.  At the same time, if a GFX or compute job is timeout, driver will trigger a new one.

Best Regards
Dennis Li
-----Original Message-----
From: Zhang, Hawking <Hawking.Zhang@xxxxxxx>
Sent: Thursday, August 20, 2020 4:24 PM
To: Li, Dennis <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Cc: Li, Dennis <Dennis.Li@xxxxxxx>
Subject: RE: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery

[AMD Public Use]

Hi Dennis,

Can you elaborate the case that driver re-enter GPU recovery in sGPU system? I'm wondering whether this is a valid case or we shall prevent this from the beginning.

Regards,
Hawking

-----Original Message-----
From: Dennis Li <Dennis.Li@xxxxxxx>
Sent: Thursday, August 20, 2020 10:21
To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>
Cc: Li, Dennis <Dennis.Li@xxxxxxx>
Subject: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery

in single gpu system, if driver reenter gpu recovery, amdgpu_device_lock_adev will return false, but hive is nullptr now.

Signed-off-by: Dennis Li <Dennis.Li@xxxxxxx>

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 82242e2f5658..81b1d9a1dca0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4371,8 +4371,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  		if (!amdgpu_device_lock_adev(tmp_adev)) {
  			DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress",
  				  job ? job->base.id : -1);
-			mutex_unlock(&hive->hive_lock);
-			return 0;
+			r = 0;
+			goto skip_recovery;
  		}
  
  		/*
@@ -4505,6 +4505,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
  		amdgpu_device_unlock_adev(tmp_adev);
  	}
  
+skip_recovery:
  	if (hive) {
  		atomic_set(&hive->in_reset, 0);
  		mutex_unlock(&hive->hive_lock);
--
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx