Hi, The goal of this patchset is to improve debugging device resets on amdgpu. The first patch creates a new module parameter to disable soft recoveries, ensuring every recovery go through the full device reset, making easier to generate resets from userspace tools like [0] and [1]. This is important to validate how the stack behaves on resets, from end-to-end. The second patch is a small addition to mark guilty jobs that causes soft recoveries for API consistency. The last patches are a rework to store more information at devcoredump files, making it more useful to be attached to bug reports. The new coredump content look like this: **** AMDGPU Device Coredump **** version: 1 kernel: 6.4.0-rc7-tony+ module: amdgpu time: 702.743534320 process_name: vulkan-triangle PID: 4561 IBs: [0] 0xffff800100545000 [1] 0xffff800100001000 ring name: gfx_0.0.0 Due to nested IBs, this may not be the one that really caused the hang, but it gives some direction. Thanks, André [0] https://gitlab.freedesktop.org/andrealmeid/gpu-timeout [1] https://github.com/andrealmeid/vulkan-triangle-v1 André Almeida (6): drm/amdgpu: Create a module param to disable soft recovery drm/amdgpu: Mark contexts guilty for causing soft recoveries drm/amdgpu: Rework coredump to use memory dynamically drm/amdgpu: Limit info in coredump for kernel threads drm/amdgpu: Log IBs and ring name at coredump drm/amdgpu: Create version number for coredumps drivers/gpu/drm/amd/amdgpu/amdgpu.h | 21 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_ctx.c | 6 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 99 +++++++++++++++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 9 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 6 +- 5 files changed, 112 insertions(+), 29 deletions(-) -- 2.41.0