Re: [PATCH v2] drm/amdgpu: Skip the coredump collection on reset during driver reload

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Wed, 17 Apr 2024 10:01:07 +0530

On 4/17/2024 9:43 AM, Ahmad Rehman wrote:
> In passthrough environment, the driver triggers the mode-1 reset on
> reload. The reset causes the core dump collection which is delayed task
> and prevents driver from unloading until it is completed. Since we do
> not need to collect data on "reset on reload" case, we can skip core
> dump collection.
> 
> v2: Use the same flag to avoid calling amdgpu_reset_reg_dumps as well.
> 
> Signed-off-by: Ahmad Rehman <Ahmad.Rehman@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 +++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    | 1 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h  | 1 +
>  3 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 1b2e177bc2d6..c718982cffa8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -5357,7 +5357,9 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>  	/* Try reset handler method first */
>  	tmp_adev = list_first_entry(device_list_handle, struct amdgpu_device,
>  				    reset_list);
> -	amdgpu_reset_reg_dumps(tmp_adev);
> +	
> +	if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags))
> +		amdgpu_reset_reg_dumps(tmp_adev);

This may be saved to some bool. Anyway,

Reviewed-by: Lijo Lazar <lijo.lazar@xxxxxxx>

Thanks,
Lijo

>  
>  	reset_context->reset_device_list = device_list_handle;
>  	r = amdgpu_reset_perform_reset(tmp_adev, reset_context);
> @@ -5430,7 +5432,8 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>  
>  				vram_lost = amdgpu_device_check_vram_lost(tmp_adev);
>  
> -				amdgpu_coredump(tmp_adev, vram_lost, reset_context);
> +				if (!test_bit(AMDGPU_SKIP_COREDUMP, &reset_context->flags))
> +					amdgpu_coredump(tmp_adev, vram_lost, reset_context);
>  
>  				if (vram_lost) {
>  					DRM_INFO("VRAM is lost due to GPU reset!\n");
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 6ea893ad9a36..c512f70b8272 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -2481,6 +2481,7 @@ static void amdgpu_drv_delayed_reset_work_handler(struct work_struct *work)
>  
>  	/* Use a common context, just need to make sure full reset is done */
>  	set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
> +	set_bit(AMDGPU_SKIP_COREDUMP, &reset_context.flags);
>  	r = amdgpu_do_asic_reset(&device_list, &reset_context);
>  
>  	if (r) {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> index 66125d43cf21..b11d190ece53 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.h
> @@ -32,6 +32,7 @@ enum AMDGPU_RESET_FLAGS {
>  
>  	AMDGPU_NEED_FULL_RESET = 0,
>  	AMDGPU_SKIP_HW_RESET = 1,
> +	AMDGPU_SKIP_COREDUMP = 2,
>  };
>  
>  struct amdgpu_reset_context {