Problem: During hive reset caused by command timing out on a ring extra resets are generated by triggered by KFD which is unable to accesses registers on the resetting ASIC. Fix: Rework GPU reset to actively stop any pending reset works while another in progress. v2: Switch from generic list as was in v1[1] to eplicit stopping of each reset request from each reset source per each request submitter. [1] - https://lore.kernel.org/all/20220504161841.24669-1-andrey.grodzovsky@xxxxxxx/ Andrey Grodzovsky (7): drm/amdgpu: Cache result of last reset at reset domain level. drm/amdgpu: Switch to delayed work from work_struct. drm/admgpu: Serialize RAS recovery work directly into reset domain queue. drm/amdgpu: Add delayed work for GPU reset from debugfs drm/amdgpu: Add delayed work for GPU reset from kfd. drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover drm/amdgpu: Stop any pending reset if another in progress. drivers/gpu/drm/amd/amdgpu/amdgpu.h | 4 +- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 15 +++++- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 62 +++++++++++----------- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 19 ++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 10 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 5 +- drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 2 +- drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 6 +-- drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 6 +-- drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 6 +-- 14 files changed, 87 insertions(+), 54 deletions(-) -- 2.25.1