RE: [PATCH v2] drm/amdgpu: fix the memleak caused by fence not released

"YuanShang Mao (River)" <YuanShang.Mao@xxxxxxx> · Fri, 14 Feb 2025 12:14:23 +0000

[AMD Official Use Only - AMD Internal Distribution Only]

Better to put the fence outside amdgpu_gem_va_update_vm. Since it is passed to the caller, and the caller must keep one reference at least until this fence is no longer needed.

Thanks
River

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of Yadav, Arvind
Sent: Friday, February 14, 2025 7:42 PM
To: Koenig, Christian <Christian.Koenig@xxxxxxx>; Ma, Le <Le.Ma@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Yadav, Arvind <Arvind.Yadav@xxxxxxx>
Cc: Zhang, Hawking <Hawking.Zhang@xxxxxxx>; Lazar, Lijo <Lijo.Lazar@xxxxxxx>
Subject: Re: [PATCH v2] drm/amdgpu: fix the memleak caused by fence not released

On 2/14/2025 4:08 PM, Christian König wrote:
> Adding Arvind, please make sure to keep him in the loop.
>
> Am 14.02.25 um 11:07 schrieb Le Ma:
>> On systems with CONFIG_SLUB_DEBUG enabled, the memleak like below
>> will show up explicitly during driver unloading if created bo without
>> drm_timeline object before.
>>
>>      BUG drm_sched_fence (Tainted: G           OE     ): Objects remaining in drm_sched_fence on __kmem_cache_shutdown()
>>      -----------------------------------------------------------------------------
>>      Call Trace:
>>      <TASK>
>>      dump_stack_lvl+0x4c/0x70
>>      dump_stack+0x14/0x20
>>      slab_err+0xb0/0xf0
>>      ? srso_alias_return_thunk+0x5/0xfbef5
>>      ? flush_work+0x12/0x20
>>      ? srso_alias_return_thunk+0x5/0xfbef5
>>      __kmem_cache_shutdown+0x163/0x2e0
>>      kmem_cache_destroy+0x61/0x170
>>      drm_sched_fence_slab_fini+0x19/0x900
>>
>> Thus call dma_fence_put properly to avoid the memleak.
>>
>> v2: call dma_fence_put in amdgpu_gem_va_update_vm
>>
>> Signed-off-by: Le Ma <le.ma@xxxxxxx>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 9 +++++++--
>>   1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> index 8b67aae6c2fe..00f1f34705c0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> @@ -759,7 +759,8 @@ static struct dma_fence *
>>   amdgpu_gem_va_update_vm(struct amdgpu_device *adev,
>>                      struct amdgpu_vm *vm,
>>                      struct amdgpu_bo_va *bo_va,
>> -                    uint32_t operation)
>> +                    uint32_t operation,
>> +                    uint32_t syncobj_handle)
>>   {
>>      struct dma_fence *fence = dma_fence_get_stub();
>>      int r;
>> @@ -771,6 +772,9 @@ amdgpu_gem_va_update_vm(struct amdgpu_device *adev,
>>      if (r)
>>              goto error;
>>
>> +    if (!syncobj_handle)
>> +            dma_fence_put(fence);
>> +
> Having that check inside amdgpu_gem_update_bo_mapping() was actually correct. Here it doesn't make much sense.

Agreed,

Regards,
~Arvind

>
>>      if (operation == AMDGPU_VA_OP_MAP ||
>>          operation == AMDGPU_VA_OP_REPLACE) {
>>              r = amdgpu_vm_bo_update(adev, bo_va, false); @@ -965,7 +969,8 @@
>> int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>                                                  &timeline_chain);
> Right before this here is a call to amdgpu_gem_update_timeline_node() which is incorrectly placed.
>
> That needs to come much earlier, above the switch (args->operation)....
>
> Regards,
> Christian.
>
>>
>>              fence = amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>> -                                            args->operation);
>> +                                            args->operation,
>> +                                            args->vm_timeline_syncobj_out);
>>
>>              if (!r)
>>                      amdgpu_gem_update_bo_mapping(filp, bo_va,