RE: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

"Zhu, Jiadong" <Jiadong.Zhu@xxxxxxx> · Mon, 18 Jul 2022 04:53:24 +0000

[AMD Official Use Only - General]

Hi Andrey,

Thank you for your clarifying.
The refcount modified by amdgpu_fence_emit on my side is different.
I update my code and get the patch 'drm/amdgpu: Follow up change to previous drm scheduler change.' , the " underflow " disappears.

My patch is not needed.

Thanks,
Jiadong

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Sent: Friday, July 15, 2022 11:43 PM
To: Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>; Christian König <ckoenig.leichtzumerken@xxxxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Cc: Huang, Ray <Ray.Huang@xxxxxxx>; Liu, Aaron <Aaron.Liu@xxxxxxx>
Subject: Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails

On 2022-07-15 05:28, Zhu, Jiadong wrote:
> [AMD Official Use Only - General]
>
> Updated some comments
>
> -----Original Message-----
> From: Zhu, Jiadong
> Sent: Friday, July 15, 2022 5:13 PM
> To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>;
> amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Grodzovsky, Andrey
> <Andrey.Grodzovsky@xxxxxxx>
> Cc: Huang, Ray <Ray.Huang@xxxxxxx>; Liu, Aaron <Aaron.Liu@xxxxxxx>
> Subject: RE: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails
>
> Hi Christian,
>
> The resubmitted job in function amdgpu_ib_preempt_job_recovery returns the same hw fence because of this commit:
>
> static void amdgpu_ib_preempt_job_recovery(struct drm_gpu_scheduler *sched) {
>          struct drm_sched_job *s_job;
>          struct dma_fence *fence;
>
>          spin_lock(&sched->job_list_lock);
>          list_for_each_entry(s_job, &sched->pending_list, list) {
>                  fence = sched->ops->run_job(s_job);       //fence returned has the same address with swapped fences
>                  dma_fence_put(fence);
>          }
>          spin_unlock(&sched->job_list_lock);
> }
>
>
>
> commit c530b02f39850a639b72d01ebbf7e5d745c60831
> Author: Jack Zhang <Jack.Zhang1@xxxxxxx>
> Date:   Wed May 12 15:06:35 2021 +0800
>
>      drm/amd/amdgpu embed hw_fence into amdgpu_job
>
>      Why: Previously hw fence is alloced separately with job.
>      It caused historical lifetime issues and corner cases.
>      The ideal situation is to take fence to manage both job
>      and fence's lifetime, and simplify the design of gpu-scheduler.
>
>      How:
>      We propose to embed hw_fence into amdgpu_job.
>      1. We cover the normal job submission by this method.
>      2. For ib_test, and submit without a parent job keep the
>      legacy way to create a hw fence separately.
>      v2:
>      use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
>      embedded in a job.
>      v3:
>      remove redundant variable ring in amdgpu_job
>      v4:
>      add tdr sequence support for this feature. Add a job_run_counter to
>      indicate whether this job is a resubmit job.
>      v5
>      add missing handling in amdgpu_fence_enable_signaling
>
>      Signed-off-by: Jingwen Chen <Jingwen.Chen2@xxxxxxx>
>      Signed-off-by: Jack Zhang <Jack.Zhang7@xxxxxxxxxxx>
>      Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@xxxxxxx>
>      Reviewed by: Monk Liu <monk.liu@xxxxxxx>
>      Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
>
>
> Thus the fence we swapped out is signaled and put twice in the following 2 functions and we get " refcount_t: underflow; use-after-free. " errors latter.
>
>                  /* wait for jobs finished */
>                  amdgpu_fence_wait_empty(ring); //wait on the resubmitted fence which is signaled and put somewhere else. The refcount decreased by 1 after amdgpu_fence_wait_empty.
>
>                  /* signal the old fences */
>                  amdgpu_ib_preempt_signal_fences(fences, length);   //signal and put the previous swapped fence, signal would return -22.
>
> Thanks,
> Jiadong

Did you have 'drm/amdgpu: Follow up change to previous drm scheduler change.' this commit in your branch while you encountered this problem ?
I don't see an underflow issue
for the preempted job when inspecting the code with this commit in mind -

amdgpu_fence_emit
             dma_fence_init 1
             dma_fence_get(fence) 2
             rcu_assign_pointer(*ptr, dma_fence_get(fence) 3

drm_sched_main
     s_fence->parent = dma_fence_get(fence); 4
     dma_fence_put(fence); 3

amdgpu_ib_preempt_job_recovery
     amdgpu_fence_emit
         if (job && job->job_run_counter) -> dma_fence_get(fence); 4
         rcu_assign_pointer(*ptr, dma_fence_get(fence)); 5

     dma_fence_put(fence); 4

amdgpu_fence_wait_empty
     dma_fence_get_rcu(fence) 5
     dma_fence_put(fence) 4

amdgpu_process_fence (EOP interrupt for re-submission of preempted job)
     dma_fence_put 3

amdgpu_ib_preempt_signal_fences
     dma_fence_put 2

amdgpu_job_free_cb
     dma_fence_put(&job->hw_fence) 1

drm_sched_fence_release_scheduled
     dma_fence_put(fence->parent); 0

Also take a look here for reference -
https://drive.google.com/file/d/1yEoeW6OQC9WnwmzFW6NBLhFP_jD0xcHm/view

Andrey

Andrey

>
>
> -----Original Message-----
> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>
> Sent: Friday, July 15, 2022 4:48 PM
> To: Zhu, Jiadong <Jiadong.Zhu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx;
> Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
> Cc: Huang, Ray <Ray.Huang@xxxxxxx>; Liu, Aaron <Aaron.Liu@xxxxxxx>
> Subject: Re: [PATCH 3/3] drm/amdgpu: skip put fence if signal fails
>
> [CAUTION: External Email]
>
> Am 15.07.22 um 10:43 schrieb jiadong.zhu@xxxxxxx:
>> From: "Jiadong.Zhu" <Jiadong.Zhu@xxxxxxx>
>>
>> Dma_fence_signal returning non-zero indicates that the fence is
>> signaled and put somewhere else.
>> Skip dma_fence_put to make the fence refcount correct.
> Well quite a big NAK on this.
>
> Reference counting should be completely independent where a fence signals.
>
> Andrey can you take a look at this as well?
>
> Thanks,
> Christian.
>
>> Signed-off-by: Jiadong.Zhu <Jiadong.Zhu@xxxxxxx>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> index f4ed0785d523..93c1a5e83835 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> @@ -1500,8 +1500,8 @@ static void amdgpu_ib_preempt_signal_fences(struct dma_fence **fences,
>>                fence = fences[i];
>>                if (!fence)
>>                        continue;
>> -             dma_fence_signal(fence);
>> -             dma_fence_put(fence);
>> +             if (!dma_fence_signal(fence))
>> +                     dma_fence_put(fence);
>>        }
>>    }
>>