[AMD Public Use] Hi, Hawking, When RAS uncorrectable error happens, RAS interrupt will trigger a GPU recovery. At the same time, if a GFX or compute job is timeout, driver will trigger a new one. Best Regards Dennis Li -----Original Message----- From: Zhang, Hawking <Hawking.Zhang@xxxxxxx> Sent: Thursday, August 20, 2020 4:24 PM To: Li, Dennis <Dennis.Li@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Cc: Li, Dennis <Dennis.Li@xxxxxxx> Subject: RE: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery [AMD Public Use] Hi Dennis, Can you elaborate the case that driver re-enter GPU recovery in sGPU system? I'm wondering whether this is a valid case or we shall prevent this from the beginning. Regards, Hawking -----Original Message----- From: Dennis Li <Dennis.Li@xxxxxxx> Sent: Thursday, August 20, 2020 10:21 To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx> Cc: Li, Dennis <Dennis.Li@xxxxxxx> Subject: [PATCH] drm/amdgpu: fix the nullptr issue when reenter GPU recovery in single gpu system, if driver reenter gpu recovery, amdgpu_device_lock_adev will return false, but hive is nullptr now. Signed-off-by: Dennis Li <Dennis.Li@xxxxxxx> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c index 82242e2f5658..81b1d9a1dca0 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c @@ -4371,8 +4371,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, if (!amdgpu_device_lock_adev(tmp_adev)) { DRM_INFO("Bailing on TDR for s_job:%llx, as another already in progress", job ? job->base.id : -1); - mutex_unlock(&hive->hive_lock); - return 0; + r = 0; + goto skip_recovery; } /* @@ -4505,6 +4505,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev, amdgpu_device_unlock_adev(tmp_adev); } +skip_recovery: if (hive) { atomic_set(&hive->in_reset, 0); mutex_unlock(&hive->hive_lock); -- 2.17.1 _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx