Re: [PATCH v3] drm/amdgpu: fix the memleak caused by fence not released

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 27.02.25 um 16:08 schrieb Yadav, Arvind:
>
> On 2/27/2025 7:55 PM, Christian König wrote:
>>
>> Am 18.02.25 um 15:53 schrieb Arvind Yadav:
>>> Encountering a taint issue during the unloading of gpu_sched
>>> due to the fence not being released/put. In this context,
>>> amdgpu_vm_clear_freed is responsible for creating a job to
>>> update the page table (PT). It allocates kmem_cache for
>>> drm_sched_fence and returns the finished fence associated
>>> with job->base.s_fence. In case of Usermode queue this finished
>>> fence is added to the timeline sync object through
>>> amdgpu_gem_update_bo_mapping, which is utilized by user
>>> space to ensure the completion of the PT update.
>>>
>>> [  508.900587] =============================================================================
>>> [  508.900605] BUG drm_sched_fence (Tainted: G                 N): Objects remaining in drm_sched_fence on __kmem_cache_shutdown()
>>> [  508.900617] -----------------------------------------------------------------------------
>>>
>>> [  508.900627] Slab 0xffffe0cc04548780 objects=32 used=2 fp=0xffff8ea81521f000 flags=0x17ffffc0000240(workingset|head|node=0|zone=2|lastcpupid=0x1fffff)
>>> [  508.900645] CPU: 3 UID: 0 PID: 2337 Comm: rmmod Tainted: G                 N 6.12.0+ #1
>>> [  508.900651] Tainted: [N]=TEST
>>> [  508.900653] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F34 06/10/2021
>>> [  508.900656] Call Trace:
>>> [  508.900659]  <TASK>
>>> [  508.900665]  dump_stack_lvl+0x70/0x90
>>> [  508.900674]  dump_stack+0x14/0x20
>>> [  508.900678]  slab_err+0xcb/0x110
>>> [  508.900687]  ? srso_return_thunk+0x5/0x5f
>>> [  508.900692]  ? try_to_grab_pending+0xd3/0x1d0
>>> [  508.900697]  ? srso_return_thunk+0x5/0x5f
>>> [  508.900701]  ? mutex_lock+0x17/0x50
>>> [  508.900708]  __kmem_cache_shutdown+0x144/0x2d0
>>> [  508.900713]  ? flush_rcu_work+0x50/0x60
>>> [  508.900719]  kmem_cache_destroy+0x46/0x1f0
>>> [  508.900728]  drm_sched_fence_slab_fini+0x19/0x970 [gpu_sched]
>>> [  508.900736]  __do_sys_delete_module.constprop.0+0x184/0x320
>>> [  508.900744]  ? srso_return_thunk+0x5/0x5f
>>> [  508.900747]  ? debug_smp_processor_id+0x1b/0x30
>>> [  508.900754]  __x64_sys_delete_module+0x16/0x20
>>> [  508.900758]  x64_sys_call+0xdf/0x20d0
>>> [  508.900763]  do_syscall_64+0x51/0x120
>>> [  508.900769]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>
>>> v2: call dma_fence_put in amdgpu_gem_va_update_vm
>>> v3: Addressed review comments from Christian.
>>>      - calling amdgpu_gem_update_timeline_node before switch.
>>>      - puting a dma_fence in case of error or !timeline_syncobj.
>>>
>>> Cc: Alex Deucher <alexander.deucher@xxxxxxx>
>>> Cc: Christian König <christian.koenig@xxxxxxx>
>>> Cc: Shashank Sharma <shashank.sharma@xxxxxxx>
>>> Cc: Sunil Khatri <sunil.khatri@xxxxxxx>
>>> Signed-off-by: Le Ma <le.ma@xxxxxxx>
>>> Signed-off-by: Arvind Yadav <arvind.yadav@xxxxxxx>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 33 +++++++++++++------------
>>>   1 file changed, 17 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> index 8b67aae6c2fe..40522b4f331b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> @@ -125,9 +125,6 @@ amdgpu_gem_update_bo_mapping(struct drm_file *filp,
>>>       struct amdgpu_vm *vm = &fpriv->vm;
>>>       struct dma_fence *last_update;
>>>   -    if (!syncobj)
>>> -        return;
>>> -
>>>       /* Find the last update fence */
>>>       switch (operation) {
>>>       case AMDGPU_VA_OP_MAP:
>>> @@ -839,6 +836,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>       struct drm_exec exec;
>>>       uint64_t va_flags;
>>>       uint64_t vm_size;
>>> +    int syncobj_status;
>>>       int r = 0;
>>>         if (args->va_address < AMDGPU_VA_RESERVED_BOTTOM) {
>>> @@ -931,6 +929,12 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>           bo_va = NULL;
>>>       }
>>>   +    syncobj_status = amdgpu_gem_update_timeline_node(filp,
>>> +                    args->vm_timeline_syncobj_out,
>>> +                    args->vm_timeline_point,
>>> +                    &timeline_syncobj,
>>> +                    &timeline_chain);
>>> +
>> You don't need syncobj_status here, just assign the return value to r and abort if we can't find any syncobj.
>
> I have not used variable 'r' because below inside switch(args->operation) the 'r' value will be reinitialized and the return value of amdgpu_gem_update_timeline_node will not be used. Here, we cannot abort because syncobj will be NULL for Non-UQ.

No, no that's wrong.

That timeline_syncobj is NULL is not an error. In other words when args->vm_timeline_syncobj_out == 0 then amdgpu_gem_update_timeline_node() should just set timeline_syncobj=NULL and return 0.

The error happens only if either args->vm_timeline_syncobj_out has a handler we can't find or if we fail to allocate memory for the timeline_chain.

In this case the return value should be EINVAl or ENOMEM and then we absolutely should abort the operation.

Regards,
Christian.

> we can use r but it will not do anything. :)
>
>
>>
>>>       switch (args->operation) {
>>>       case AMDGPU_VA_OP_MAP:
>>>           va_flags = amdgpu_gem_va_map_flags(adev, args->flags);
>>> @@ -957,22 +961,19 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>           break;
>>>       }
>>>       if (!r && !(args->flags & AMDGPU_VM_DELAY_UPDATE) && !adev->debug_vm) {
>>> -
>>> -        r = amdgpu_gem_update_timeline_node(filp,
>>> -                            args->vm_timeline_syncobj_out,
>>> -                            args->vm_timeline_point,
>>> -                            &timeline_syncobj,
>>> -                            &timeline_chain);
>>> -
>>>           fence = amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>>                           args->operation);
>>>   -        if (!r)
>>> -            amdgpu_gem_update_bo_mapping(filp, bo_va,
>>> -                             args->operation,
>>> -                             args->vm_timeline_point,
>>> -                             fence, timeline_syncobj,
>>> -                             timeline_chain);
>>> +        if (syncobj_status || !timeline_syncobj) {
>>> +            dma_fence_put(fence);
>>> +            goto error;
>>> +        }
>> That should probably be something like this:
>>
>> if (timeline_syncobj)
>>      amdgpu_gem_update_bo_mapping(..);
>> else
>>      dma_fence_put(fence);
> Noted....
>
> Thanks,
> Arvind
>>
>> Regards,
>> Christian.
>>
>>> +
>>> +        amdgpu_gem_update_bo_mapping(filp, bo_va,
>>> +                         args->operation,
>>> +                         args->vm_timeline_point,
>>> +                         fence, timeline_syncobj,
>>> +                         timeline_chain);
>>>       }
>>>     error:




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux