On 2023-03-08 21:27, YuBiao Wang wrote: > v2: Add comments to clarify in the code. > > [Why] > For engines not supporting soft reset, i.e. VCN, there will be a failed > ib test before mode 1 reset during asic reset. The fences in this case > are never signaled and next time when we try to free the sa_bo, kernel > will hang. > > [How] > During pre_asic_reset, driver will clear job fences and afterwards the > fences' refcount will be reduced to 1. For drm_sched_jobs it will be > released in job_free_cb, and for non-sched jobs like ib_test, it's meant > to be released in sa_bo_free but only when the fences are signaled. So > we have to force signal the non_sched bad job's fence during > pre_asic_reset or the clear is not complete. > > Signed-off-by: YuBiao Wang <YuBiao.Wang@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > index faff4a3f96e6..ad7c5b70c35a 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring) > { > int i; > struct dma_fence *old, **ptr; > + struct amdgpu_job *job; > > for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) { > ptr = &ring->fence_drv.fences[i]; > @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring) > if (old && old->ops == &amdgpu_job_fence_ops) { > RCU_INIT_POINTER(*ptr, NULL); > dma_fence_put(old); > + /* For non-sched bad job, i.e. failed ib test, we need to force > + * signal it right here or we won't be able to track them in fence drv > + * and they will remain unsignaled during sa_bo free. > + */ > + job = container_of(old, struct amdgpu_job, hw_fence); > + if (!job->base.s_fence && !dma_fence_is_signaled(old)) > + dma_fence_signal(old); Conceptually, I don't mind this patch for what it does. The only thing which worries me is this check here, !job->base.s_fence, which is used here to qualify that we can signal the fence (and of course that the fence is not yet signalled.) We need to audit this check to make sure that it is not overloaded to mean other things. I'll take a look. > } > } > } -- Regards, Luben