On 4/23/2024 1:15 AM, Yunxiang Li wrote: > Reset request from KFD is missing a check for if a reset is already in > progress, this causes a second reset to be triggered right after the > previous one finishes. Add the check to align with the other reset sources. > > Signed-off-by: Yunxiang Li <Yunxiang.Li@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > index 3b4591f554f1..ce3dbb1cc2da 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c > @@ -283,7 +283,7 @@ int amdgpu_amdkfd_post_reset(struct amdgpu_device *adev) > > void amdgpu_amdkfd_gpu_reset(struct amdgpu_device *adev) > { > - if (amdgpu_device_should_recover_gpu(adev)) > + if (amdgpu_device_should_recover_gpu(adev) && !amdgpu_in_reset(adev)) > amdgpu_reset_domain_schedule(adev->reset_domain, > &adev->kfd.reset_work); We can't do this technically as there are cases where we skip full device reset (even then amdgpu_in_reset will return true). The better thing to do is to move amdgpu_device_stop_pending_resets() later in gpu_recover()- if a device has undergone full reset, then cancel all pending resets. Presently it's happening earlier which could be why this issue is seen. Thanks, Lijo > }