RE: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume

"Zhao, Victor" <Victor.Zhao@xxxxxxx> · Thu, 15 Sep 2022 06:38:12 +0000

[AMD Official Use Only - General]

Hi Lijo,

IH resume was added to resolve an issue found during mode2 bring up on sienna cichlid:
- close down mode2 reset and do a mode1 reset first
- open mode2 reset and do a mode2 reset. Mode2 reset was found fail in this case.

Resume IH helps in this case

Thanks,
Victor

-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> 
Sent: Thursday, September 15, 2022 1:58 PM
To: Zhao, Victor <Victor.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Deng, Emily <Emily.Deng@xxxxxxx>; Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Subject: Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume

On 9/14/2022 3:40 PM, Victor Zhao wrote:
> [background]
> On current sienna cichlid mode2 reset, on the slow job hang cases, 
> since page table context was reverted to completely stop gpu, it will 
> generate page fault interrupt.
> 
> Since the irq are open during recovery stage, during ih resume step, 
> if this interrupt was in processing, which increased ih ring rptr, and 
> ih resume meanwhile will set rptr and wptr to 0. This may cause

AFAIK, only GFX/SDMA are affected by mode-2. IH is not suspended before mode-2. Why do you resume IH after mode-2 when it is not suspended? Is it a special case for virtualization?

Thanks,
Lijo

> rptr greater than wptr. Such case was not handled in ih process, and 
> it will cause rptr continue increasing util reaches the max.
> Such case will make fence fallback situation happen.
> 
> [how]
> Move the enable of irq after ih resumed and before ib test.
> Adjusting the position of enable irq on other reset paths accordingly.
> 
> Signed-off-by: Victor Zhao <Victor.Zhao@xxxxxxx>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 8 ++++----
>   drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c | 1 +
>   2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c0cfae52f12b..0b658225e9ef 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4625,8 +4625,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>   		amdgpu_fence_driver_force_completion(ring);
>   	}
>   
> -	amdgpu_fence_driver_isr_toggle(adev, false);
> -
>   	if (job && job->vm)
>   		drm_sched_increase_karma(&job->base);
>   
> @@ -4758,6 +4756,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>   		test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
>   	skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET, 
> &reset_context->flags);
>   
> +	list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
> +		amdgpu_fence_driver_isr_toggle(tmp_adev, false);
> +	}
> +
>   	/*
>   	 * ASIC reset has to be done on all XGMI hive nodes ASAP
>   	 * to allow proper links negotiation in FW (within 1 sec) @@ 
> -5031,8 +5033,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>   			/* Clear this failed job from fence array */
>   			amdgpu_fence_driver_clear_job_fences(ring);
>   
> -			amdgpu_fence_driver_isr_toggle(adev, false);
> -
>   			/* Since the job won't signal and we go for
>   			 * another resubmit drop this parent pointer
>   			 */
> diff --git a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c 
> b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
> index 7aa570c1ce4a..953036482d1f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
> @@ -240,6 +240,7 @@ sienna_cichlid_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl,
>   	* Add this ASIC as tracked as reset was already
>   	* complete successfully.
>   	*/
> +	amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>   	amdgpu_register_gpu_instance(tmp_adev);
>   
>   	/* Resume RAS */
>