On 2023-03-08 21:27, YuBiao Wang wrote: > v2: Add comments to clarify in the code. > > [Why] > For engines not supporting soft reset, i.e. VCN, there will be a failed > ib test before mode 1 reset during asic reset. The fences in this case > are never signaled and next time when we try to free the sa_bo, kernel > will hang. > > [How] > During pre_asic_reset, driver will clear job fences and afterwards the > fences' refcount will be reduced to 1. For drm_sched_jobs it will be > released in job_free_cb, and for non-sched jobs like ib_test, it's meant > to be released in sa_bo_free but only when the fences are signaled. So > we have to force signal the non_sched bad job's fence during > pre_asic_reset or the clear is not complete. > > Signed-off-by: YuBiao Wang <YuBiao.Wang@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > index faff4a3f96e6..ad7c5b70c35a 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c > @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring) > { > int i; > struct dma_fence *old, **ptr; > + struct amdgpu_job *job; > > for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) { > ptr = &ring->fence_drv.fences[i]; > @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring) > if (old && old->ops == &amdgpu_job_fence_ops) { > RCU_INIT_POINTER(*ptr, NULL); > dma_fence_put(old); > + /* For non-sched bad job, i.e. failed ib test, we need to force > + * signal it right here or we won't be able to track them in fence drv > + * and they will remain unsignaled during sa_bo free. > + */ > + job = container_of(old, struct amdgpu_job, hw_fence); > + if (!job->base.s_fence && !dma_fence_is_signaled(old)) > + dma_fence_signal(old); Hi YuBiao, Thanks for adding the clarifying comments and sending a v2 of this patch. Perhaps move the chunk you're adding, to sit before, rather than after, the statements of the if-conditional. Also move the "job" variable declaration to be inside the if-conditional, since it is not used by anything outside it. Something like this, (note a few small fixes to the comment), if (old && old->ops == &amdgpu_job_fence_ops) { struct amdgpu_job *job; /* For non-scheduler bad job, i.e. failed IB test, we need to signal * it right here or we won't be able to track them in fence_drv * and they will remain unsignaled during sa_bo free. */ job = container_of(old, struct amdgpu_job, hw_fence); if (!job->base.s_fence && !dma_fence_is_signaled(old)) dma_fence_signal(old); RCU_INIT_POINTER(*ptr, NULL); dma_fence_put(old); } Then, give it a test. With that change, and upon successful test results, this patch is, Acked-by: Luben Tuikov <luben.tuikov@xxxxxxx> -- Regards, Luben