ping On 2021/10/22 AM11:33, Jingwen Chen wrote: > [Why] > In advance tdr mode, the real bad job will be resubmitted twice, while > in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job > is put one more time than other jobs. > > [How] > Adding dma_fence_get before resbumit job in > amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs > > Signed-off-by: Jingwen Chen <Jingwen.Chen2@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > index 41ce86244144..975f069f6fe8 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c > @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs( > > /* clear job's guilty and depend the folowing step to decide the real one */ > drm_sched_reset_karma(s_job); > + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get > + * to make sure fence is balanced */ > + dma_fence_get(s_job->s_fence->parent); > drm_sched_resubmit_jobs_ext(&ring->sched, 1); > > ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout); > @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs( > > /* got the hw fence, signal finished fence */ > atomic_dec(ring->sched.score); > + dma_fence_put(s_job->s_fence->parent); > dma_fence_get(&s_job->s_fence->finished); > dma_fence_signal(&s_job->s_fence->finished); > dma_fence_put(&s_job->s_fence->finished);