Hi, The goal of this patchset is to improve debugging device resets on amdgpu. The first patch creates a new module parameter to disable soft recoveries, ensuring every recovery go through the full device reset, making easier to generate resets from userspace tools like [0] and [1]. This is important to validate how the stack behaves on resets, from end-to-end. The last patches are a rework to store more information at devcoredump files, making it more useful to be attached to bug reports. The new coredump content look like this: **** AMDGPU Device Coredump **** version: 1 kernel: 6.4.0-rc7-tony+ module: amdgpu time: 702.743534320 process_name: vulkan-triangle PID: 4561 IBs: [0] 0xffff800100545000 [1] 0xffff800100001000 ring name: gfx_0.0.0 Due to nested IBs, this may not be the one that really caused the hang, but it gives some direction. Thanks, André [0] https://gitlab.freedesktop.org/andrealmeid/gpu-timeout [1] https://github.com/andrealmeid/vulkan-triangle-v1 Changelog: v1: https://lore.kernel.org/dri-devel/20230711213501.526237-1-andrealmeid@xxxxxxxxxx/ - Drop "Mark contexts guilty for causing soft recoveries" patch - Use GFP_NOWAIT for devcoredump allocation André Almeida (6): drm/amdgpu: Create a module param to disable soft recovery drm/amdgpu: Allocate coredump memory in a nonblocking way drm/amdgpu: Rework coredump to use memory dynamically drm/amdgpu: Limit info in coredump for kernel threads drm/amdgpu: Log IBs and ring name at coredump drm/amdgpu: Create version number for coredumps drivers/gpu/drm/amd/amdgpu/amdgpu.h | 21 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 99 +++++++++++++++++----- drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 9 ++ drivers/gpu/drm/amd/amdgpu/amdgpu_ring.c | 6 +- 4 files changed, 106 insertions(+), 29 deletions(-) -- 2.41.0