On 2021/10/28 上午3:43, Andrey Grodzovsky wrote: > > On 2021-10-25 10:57 p.m., JingWen Chen wrote: >> On 2021/10/25 下午11:18, Andrey Grodzovsky wrote: >>> On 2021-10-24 10:56 p.m., JingWen Chen wrote: >>>> On 2021/10/23 上午4:41, Andrey Grodzovsky wrote: >>>>> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ? >>>> yes >>> >>> Then maybe update the description because 'underflow' is very confusing >>> >> will do >>>>> On 2021-10-22 4:14 a.m., JingWen Chen wrote: >>>>>> ping >>>>>> >>>>>> On 2021/10/22 AM11:33, Jingwen Chen wrote: >>>>>>> [Why] >>>>>>> In advance tdr mode, the real bad job will be resubmitted twice, while >>>>>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job >>>>>>> is put one more time than other jobs. >>>>>>> >>>>>>> [How] >>>>>>> Adding dma_fence_get before resbumit job in >>>>>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs >>>>>>> >>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@xxxxxxx> >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++ >>>>>>> 1 file changed, 4 insertions(+) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> index 41ce86244144..975f069f6fe8 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c >>>>>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs( >>>>>>> /* clear job's guilty and depend the folowing step to decide the real one */ >>>>>>> drm_sched_reset_karma(s_job); >>>>>>> + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get >>>>>>> + * to make sure fence is balanced */ >>>>> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence. >>>>> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one. >>>>> >>>>> Andrey >>>>> >>>>> >>>> Hi Andrey, >>>> >>>> If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run. >>> >>> Right, I forgot that... What about removing line https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265 ? >>> What if you make dma_get_fence unconditional instead ? >>> >>> Andrey >>> >>> >> Hi Andrey, >> >> I have tried this and this will cause normal jobs cannot be free(lacks a dma_fence_put). > > > I can't see it - can you point me where in that case you get unbalanced refcount ? As far as I see for a a normal job > being ran in amdgpu_device_recheck_guilty_jobs the refcount on hw_fence is - > > drm_sched_resubmit_jobs_ext->dma_fence_put -> refcount decrease by 1 > drm_sched_resubmit_jobs_ext->amdgpu_job_run->dma_fence_get increase by 1 > > In total refcount didn't change until now > > Next, dma_fence_wait_timeout completed successfully because the job is normal and then you delete that job from pending list and call the > free_job cb which drops remaining refcounts on the hw_fence. > > I am probably missing some dma_fence_get since you checked it on a device but I wonder where is my mistake ? > > Andrey > > Hi Andrey, The thing is the put/get is balanced right now for normal jobs in TDR. Changing this dma_fence_get to unconditional simply adds 1 dma_fence_get but there's no corresponding dma_fence_put for normal jobs. And if this can be helpful, I try to find all dma_fence_get/put for a normal job in advance TDR based on the latest drm-next. amdgpu_fence_emit -> dma_fence_init ref_count = 1 amdgpu_fence_emit -> add into rcu ref_count = 2 amdgpu_job_run->get after ib_schedule ref_count = 3 drm_sched_main-> add fence callback get ref_count = 4 drm_sched_main-> add fence callback put ref_count = 3 drm_sched_resubmit_jobs_ext ref_count = 2 amdgpu_fence_emit -> add into rcu ref_count = 3 amdgpu_fence_process-> put after signal ref_count = 2 drm_sched_main->free_job ref_count = 1 drm_sched_fence_release_finished ref_count = 0 If we do unconditional get, this sequence will turn into: amdgpu_fence_emit -> dma_fence_init ref_count = 1 amdgpu_fence_emit -> add into rcu ref_count = 2 amdgpu_job_run->get after ib_schedule ref_count = 3 drm_sched_main-> add fence callback get ref_count = 4 drm_sched_main-> add fence callback put ref_count = 3 drm_sched_resubmit_jobs_ext ref_count = 2 amdgpu_fence_emit -> add into rcu ref_count = 3 + amdgpu_job_run->get after ib_schedule ref_count = 4 amdgpu_fence_process-> put after signal ref_count = 3 drm_sched_main->free_job ref_count = 2 drm_sched_fence_release_finished ref_count = 1 > >> I have figured out all the get/put >> >> for sched_jobs and only the bad job lacks a dma_fence_get, other jobs are just fine. >> >>>>>>> + dma_fence_get(s_job->s_fence->parent); >>>>>>> drm_sched_resubmit_jobs_ext(&ring->sched, 1); >>>>>>> ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout); >>>>>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs( >>>>>>> /* got the hw fence, signal finished fence */ >>>>>>> atomic_dec(ring->sched.score); >>>>>>> + dma_fence_put(s_job->s_fence->parent); >>>>>>> dma_fence_get(&s_job->s_fence->finished); >>>>>>> dma_fence_signal(&s_job->s_fence->finished); >>>>>>> dma_fence_put(&s_job->s_fence->finished);