Presently, there are multiple clients of reset like RAS, job timeout, KFD hang detection and debug method. Instead of each client maintaining a work item, reset work pool is moved to reset domain. When a client makes a recovery request, a work item is allocated by the reset domain and queued for execution. For the case of job timeout, each ring has its own TDR queue to which tdr work is scheduled. From there, it's further queued to a reset domain based on the device configuration. This allows flexibility to have multiple reset domains. For example, when there are partitions, each partition can maintain its own reset domain and a job timeout on one partition doesn't affect jobs on the other partition (when the jobs don't have any interdependency). The reset logic will select the appropriate reset domain based on the current device configuration. Lijo Lazar (5): drm/amdgpu: Add work pool to reset domain drm/amdgpu: Move to reset_schedule_work drm/amdgpu: Set flags to cancel all pending resets drm/amdgpu: Add API to queue and do reset work drm/amdgpu: Add TDR queue for ring drivers/gpu/drm/amd/amdgpu/amdgpu.h | 2 - drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 32 +++--- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 1 - drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 24 +--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 40 +++---- drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 16 ++- drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 71 ++++++------ drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 122 ++++++++++++++++++++- drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 32 +++++- drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h | 1 + drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h | 1 - drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 38 +++---- drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c | 44 ++++---- drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c | 33 +++--- 15 files changed, 285 insertions(+), 177 deletions(-) -- 2.25.1