On 4/17/2024 1:14 PM, Khatri, Sunil wrote: > > On 4/17/2024 1:06 PM, Khatri, Sunil wrote: >> devcoredump is used to debug gpu hangs/resets. So in normal process >> when there is a hang due to ring timeout or page fault we are doing a >> hard reset as soft reset fail in those cases. How are we making sure >> that the devcoredump is triggered in those cases and captured? >> >> Regards >> Sunil Khatri >> >> On 4/17/2024 9:43 AM, Ahmad Rehman wrote: >>> In passthrough environment, the driver triggers the mode-1 reset on >>> reload. The reset causes the core dump collection which is delayed task >>> and prevents driver from unloading until it is completed. Since we do >>> not need to collect data on "reset on reload" case, we can skip core >>> dump collection. >>> >>> v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well. >>> >>> Signed-off-by: Ahmad Rehman <Ahmad.Rehman@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++-- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 1 + >>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h | 1 + >>> 3 files changed, 7 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 1b2e177bc2d6..c718982cffa8 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -5357,7 +5357,9 @@ int amdgpu_do_asic_reset(struct list_head >>> *device_list_handle, >>> /* Try reset handler method first */ >>> tmp_adev = list_first_entry(device_list_handle, struct >>> amdgpu_device, >>> reset_list); >>> - amdgpu_reset_reg_dumps(tmp_adev); >>> + >>> + if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags)) >>> + amdgpu_reset_reg_dumps(tmp_adev); >>> reset_context->reset_device_list = device_list_handle; >>> r = amdgpu_reset_perform_reset(tmp_adev, reset_context); >>> @@ -5430,7 +5432,8 @@ int amdgpu_do_asic_reset(struct list_head >>> *device_list_handle, >>> vram_lost = amdgpu_device_check_vram_lost(tmp_adev); >>> - amdgpu_coredump(tmp_adev, vram_lost, reset_context); >>> + if (!test_bit(AMDGPU_SKIP_COREDUMP, >>> &reset_context->flags)) >>> + amdgpu_coredump(tmp_adev, vram_lost, >>> reset_context); >>> if (vram_lost) { >>> DRM_INFO("VRAM is lost due to GPU reset!\n"); >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> index 6ea893ad9a36..c512f70b8272 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >>> @@ -2481,6 +2481,7 @@ static void >>> amdgpu_drv_delayed_reset_work_handler(struct work_struct *work) >>> /* Use a common context, just need to make sure full reset is >>> done */ >>> set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags); >>> + set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags); > If this is used for guests only can we better have a flag like > amdgpu_sriov_vf for setting the skip coredump flag ?? > A reset is not always triggered just because of hang. There are other cases like we want to do a reset after a suspend/resume cycle so that the device starts from a clean state. Those are intentionally triggered by driver. Also, there are case like RAS errors where we reset and that also really doesn't need a core dump. In all such cases, this flag is required, and this is one such case (this patch only addresses passthrough). Thanks, Lijo > Regards > Sunil khatri > >>> r = amdgpu_do_asic_reset(&device_list, &reset_context); >>> if (r) { >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h >>> index 66125d43cf21..b11d190ece53 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h >>> @@ -32,6 +32,7 @@ enum AMDGPU_RESET_FLAGS { >>> AMDGPU_NEED_FULL_RESET = 0, >>> AMDGPU_SKIP_HW_RESET = 1, >>> + AMDGPU_SKIP_COREDUMP = 2, >>> }; >>> struct amdgpu_reset_context {