RE: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[AMD Official Use Only - General]

Hi Luben

Please let us know if you don't have bandwidth to review and we can require other people to review, please be understand that we are in an urgent situation and need this change to go to mainline this week.

Thanks 
-------------------------------------------------------------------
Monk Liu | Cloud GPU & Virtualization Software | AMD
-------------------------------------------------------------------

-----Original Message-----
From: Wang, YuBiao <YuBiao.Wang@xxxxxxx> 
Sent: 2023年3月14日 14:34
To: Tuikov, Luben <Luben.Tuikov@xxxxxxx>
Cc: Quan, Evan <Evan.Quan@xxxxxxx>; Chen, Horace <Horace.Chen@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; Xu, Feifei <Feifei.Xu@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: RE: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

Hi Luben,

I'd have to ping you because we've got a P1 ticket currently on this issue. Would you please give a vague time when would you confirm whether this patch is safe? Thank you a lot for helping double check this.

Regards & Thanks,
Yubiao 

-----Original Message-----
From: Tuikov, Luben <Luben.Tuikov@xxxxxxx>
Sent: Saturday, March 11, 2023 12:56 AM
To: Wang, YuBiao <YuBiao.Wang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Quan, Evan <Evan.Quan@xxxxxxx>; Chen, Horace <Horace.Chen@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; Deucher, Alexander <Alexander.Deucher@xxxxxxx>; Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>; Xu, Feifei <Feifei.Xu@xxxxxxx>; Wang, Yang(Kevin) <KevinYang.Wang@xxxxxxx>
Subject: Re: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

On 2023-03-08 21:27, YuBiao Wang wrote:
> v2: Add comments to clarify in the code.
> 
> [Why]
> For engines not supporting soft reset, i.e. VCN, there will be a 
> failed ib test before mode 1 reset during asic reset. The fences in 
> this case are never signaled and next time when we try to free the 
> sa_bo, kernel will hang.
> 
> [How]
> During pre_asic_reset, driver will clear job fences and afterwards the 
> fences' refcount will be reduced to 1. For drm_sched_jobs it will be 
> released in job_free_cb, and for non-sched jobs like ib_test, it's 
> meant to be released in sa_bo_free but only when the fences are 
> signaled. So we have to force signal the non_sched bad job's fence 
> during pre_asic_reset or the clear is not complete.
> 
> Signed-off-by: YuBiao Wang <YuBiao.Wang@xxxxxxx>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index faff4a3f96e6..ad7c5b70c35a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct
> amdgpu_ring *ring)  {
>  	int i;
>  	struct dma_fence *old, **ptr;
> +	struct amdgpu_job *job;
>  
>  	for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>  		ptr = &ring->fence_drv.fences[i];
> @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
>  		if (old && old->ops == &amdgpu_job_fence_ops) {
>  			RCU_INIT_POINTER(*ptr, NULL);
>  			dma_fence_put(old);
> +			/* For non-sched bad job, i.e. failed ib test, we need to force
> +			 * signal it right here or we won't be able to track them in fence drv
> +			 * and they will remain unsignaled during sa_bo free.
> +			 */
> +			job = container_of(old, struct amdgpu_job, hw_fence);
> +			if (!job->base.s_fence && !dma_fence_is_signaled(old))
> +				dma_fence_signal(old);

Conceptually, I don't mind this patch for what it does. The only thing which worries me is this check here, !job->base.s_fence, which is used here to qualify that we can signal the fence (and of course that the fence is not yet signalled.) We need to audit this check to make sure that it is not overloaded to mean other things. I'll take a look.

>  		}
>  	}
>  }

--
Regards,
Luben




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux