On 2021/10/23 上午4:41, Andrey Grodzovsky wrote: > > What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ? yes > > On 2021-10-22 4:14 a.m., JingWen Chen wrote: >> ping >> >> On 2021/10/22 AM11:33, Jingwen Chen wrote: >>> [Why] >>> In advance tdr mode, the real bad job will be resubmitted twice, while >>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job >>> is put one more time than other jobs. >>> >>> [How] >>> Adding dma_fence_get before resbumit job in >>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs >>> >>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++ >>> 1 file changed, 4 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> index 41ce86244144..975f069f6fe8 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs( >>> /* clear job's guilty and depend the folowing step to decide the real one */ >>> drm_sched_reset_karma(s_job); >>> + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get >>> + * to make sure fence is balanced */ > > > But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence. > fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one. > > Andrey > > Hi Andrey, If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run. >>> + dma_fence_get(s_job->s_fence->parent); >>> drm_sched_resubmit_jobs_ext(&ring->sched, 1); >>> ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout); >>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs( >>> /* got the hw fence, signal finished fence */ >>> atomic_dec(ring->sched.score); >>> + dma_fence_put(s_job->s_fence->parent); >>> dma_fence_get(&s_job->s_fence->finished); >>> dma_fence_signal(&s_job->s_fence->finished); >>> dma_fence_put(&s_job->s_fence->finished);