RE: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

"Lou, Wentao" <Wentao.Lou@xxxxxxx> · Wed, 26 Dec 2018 06:38:15 +0000

Hi Andrey,

In amd-staging-dkms-4.18's merged list, I can't find 'drm/sched: Refactor ring mirror list handling', neither 'drm/sched: Rework HW fence processing'.
Now there was still much Call-Trace in new osdb triggered in dma_fence_set_error. Do you have link for these patches?
Thanks.

BR,
Wentao

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> 
Sent: Saturday, December 22, 2018 12:57 AM
To: Lou, Wentao <Wentao.Lou@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amdgpu: dma_fence finished signaled by unexpected callback

I believe this issue would be resolved by my pending  in review patch set, specifically 'drm/sched: Refactor ring mirror list handling.' since already on the first TO handler it will go over all the rings including the second timed out ring and will remove all call backs including the bad job cb. In case by this time this bad job will signal for some reason it will be removed from the mirror list already during drm_sched_process_job (take a look at 'drm/sched: Rework HW fence
processing.') and hence will not be rerun in drm_sched_job_recovery (drm_sched_resubmit_jobs under the new name).

Andrey

On 12/21/2018 03:25 AM, wentalou wrote:
> When 2 rings met timeout at same time, triggered job_timedout separately.
> Each job_timedout called gpu_recover, but one of gpu_recover locked by another's mutex_lock.
> Bad jod’s callback should be removed by dma_fence_remove_callback but locked inside mutex_lock.
> So dma_fence_remove_callback could not be called immediately.
> Then callback drm_sched_process_job triggered unexpectedly, and signaled DMA_FENCE_FLAG_SIGNALED_BIT.
> After another's mutex_unlock, signaled bad job went through job_run inside drm_sched_job_recovery.
> job_run would have WARN_ON and Call-Trace, when calling kcl_dma_fence_set_error for signaled bad job.
>
> Change-Id: I6366add13f020476882b2b8b03330a58d072dd1a
> Signed-off-by: Wentao Lou <Wentao.Lou@xxxxxxx>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 0a17fb1..fc1d3a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -225,8 +225,11 @@ static struct dma_fence *amdgpu_job_run(struct 
> drm_sched_job *sched_job)
>   
>   	trace_amdgpu_sched_run_job(job);
>   
> -	if (job->vram_lost_counter != atomic_read(&ring->adev->vram_lost_counter))
> +	if (job->vram_lost_counter != atomic_read(&ring->adev->vram_lost_counter)) {
> +		/* flags might be signaled by unexpected callback, clear it */
> +		test_and_clear_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &finished->flags);
>   		dma_fence_set_error(finished, -ECANCELED);/* skip IB as well if 
> VRAM lost */
> +	}
>   
>   	if (finished->error < 0) {
>   		DRM_INFO("Skip scheduling IBs!\n");

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx