Am 27.02.25 um 16:08 schrieb Yadav, Arvind: > > On 2/27/2025 7:55 PM, Christian König wrote: >> >> Am 18.02.25 um 15:53 schrieb Arvind Yadav: >>> Encountering a taint issue during the unloading of gpu_sched >>> due to the fence not being released/put. In this context, >>> amdgpu_vm_clear_freed is responsible for creating a job to >>> update the page table (PT). It allocates kmem_cache for >>> drm_sched_fence and returns the finished fence associated >>> with job->base.s_fence. In case of Usermode queue this finished >>> fence is added to the timeline sync object through >>> amdgpu_gem_update_bo_mapping, which is utilized by user >>> space to ensure the completion of the PT update. >>> >>> [ 508.900587] ============================================================================= >>> [ 508.900605] BUG drm_sched_fence (Tainted: G N): Objects remaining in drm_sched_fence on __kmem_cache_shutdown() >>> [ 508.900617] ----------------------------------------------------------------------------- >>> >>> [ 508.900627] Slab 0xffffe0cc04548780 objects=32 used=2 fp=0xffff8ea81521f000 flags=0x17ffffc0000240(workingset|head|node=0|zone=2|lastcpupid=0x1fffff) >>> [ 508.900645] CPU: 3 UID: 0 PID: 2337 Comm: rmmod Tainted: G N 6.12.0+ #1 >>> [ 508.900651] Tainted: [N]=TEST >>> [ 508.900653] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS ELITE/X570 AORUS ELITE, BIOS F34 06/10/2021 >>> [ 508.900656] Call Trace: >>> [ 508.900659] <TASK> >>> [ 508.900665] dump_stack_lvl+0x70/0x90 >>> [ 508.900674] dump_stack+0x14/0x20 >>> [ 508.900678] slab_err+0xcb/0x110 >>> [ 508.900687] ? srso_return_thunk+0x5/0x5f >>> [ 508.900692] ? try_to_grab_pending+0xd3/0x1d0 >>> [ 508.900697] ? srso_return_thunk+0x5/0x5f >>> [ 508.900701] ? mutex_lock+0x17/0x50 >>> [ 508.900708] __kmem_cache_shutdown+0x144/0x2d0 >>> [ 508.900713] ? flush_rcu_work+0x50/0x60 >>> [ 508.900719] kmem_cache_destroy+0x46/0x1f0 >>> [ 508.900728] drm_sched_fence_slab_fini+0x19/0x970 [gpu_sched] >>> [ 508.900736] __do_sys_delete_module.constprop.0+0x184/0x320 >>> [ 508.900744] ? srso_return_thunk+0x5/0x5f >>> [ 508.900747] ? debug_smp_processor_id+0x1b/0x30 >>> [ 508.900754] __x64_sys_delete_module+0x16/0x20 >>> [ 508.900758] x64_sys_call+0xdf/0x20d0 >>> [ 508.900763] do_syscall_64+0x51/0x120 >>> [ 508.900769] entry_SYSCALL_64_after_hwframe+0x76/0x7e >>> >>> v2: call dma_fence_put in amdgpu_gem_va_update_vm >>> v3: Addressed review comments from Christian. >>> - calling amdgpu_gem_update_timeline_node before switch. >>> - puting a dma_fence in case of error or !timeline_syncobj. >>> >>> Cc: Alex Deucher <alexander.deucher@xxxxxxx> >>> Cc: Christian König <christian.koenig@xxxxxxx> >>> Cc: Shashank Sharma <shashank.sharma@xxxxxxx> >>> Cc: Sunil Khatri <sunil.khatri@xxxxxxx> >>> Signed-off-by: Le Ma <le.ma@xxxxxxx> >>> Signed-off-by: Arvind Yadav <arvind.yadav@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c | 33 +++++++++++++------------ >>> 1 file changed, 17 insertions(+), 16 deletions(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>> index 8b67aae6c2fe..40522b4f331b 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c >>> @@ -125,9 +125,6 @@ amdgpu_gem_update_bo_mapping(struct drm_file *filp, >>> struct amdgpu_vm *vm = &fpriv->vm; >>> struct dma_fence *last_update; >>> - if (!syncobj) >>> - return; >>> - >>> /* Find the last update fence */ >>> switch (operation) { >>> case AMDGPU_VA_OP_MAP: >>> @@ -839,6 +836,7 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, >>> struct drm_exec exec; >>> uint64_t va_flags; >>> uint64_t vm_size; >>> + int syncobj_status; >>> int r = 0; >>> if (args->va_address < AMDGPU_VA_RESERVED_BOTTOM) { >>> @@ -931,6 +929,12 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, >>> bo_va = NULL; >>> } >>> + syncobj_status = amdgpu_gem_update_timeline_node(filp, >>> + args->vm_timeline_syncobj_out, >>> + args->vm_timeline_point, >>> + &timeline_syncobj, >>> + &timeline_chain); >>> + >> You don't need syncobj_status here, just assign the return value to r and abort if we can't find any syncobj. > > I have not used variable 'r' because below inside switch(args->operation) the 'r' value will be reinitialized and the return value of amdgpu_gem_update_timeline_node will not be used. Here, we cannot abort because syncobj will be NULL for Non-UQ. No, no that's wrong. That timeline_syncobj is NULL is not an error. In other words when args->vm_timeline_syncobj_out == 0 then amdgpu_gem_update_timeline_node() should just set timeline_syncobj=NULL and return 0. The error happens only if either args->vm_timeline_syncobj_out has a handler we can't find or if we fail to allocate memory for the timeline_chain. In this case the return value should be EINVAl or ENOMEM and then we absolutely should abort the operation. Regards, Christian. > we can use r but it will not do anything. :) > > >> >>> switch (args->operation) { >>> case AMDGPU_VA_OP_MAP: >>> va_flags = amdgpu_gem_va_map_flags(adev, args->flags); >>> @@ -957,22 +961,19 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data, >>> break; >>> } >>> if (!r && !(args->flags & AMDGPU_VM_DELAY_UPDATE) && !adev->debug_vm) { >>> - >>> - r = amdgpu_gem_update_timeline_node(filp, >>> - args->vm_timeline_syncobj_out, >>> - args->vm_timeline_point, >>> - &timeline_syncobj, >>> - &timeline_chain); >>> - >>> fence = amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va, >>> args->operation); >>> - if (!r) >>> - amdgpu_gem_update_bo_mapping(filp, bo_va, >>> - args->operation, >>> - args->vm_timeline_point, >>> - fence, timeline_syncobj, >>> - timeline_chain); >>> + if (syncobj_status || !timeline_syncobj) { >>> + dma_fence_put(fence); >>> + goto error; >>> + } >> That should probably be something like this: >> >> if (timeline_syncobj) >> amdgpu_gem_update_bo_mapping(..); >> else >> dma_fence_put(fence); > Noted.... > > Thanks, > Arvind >> >> Regards, >> Christian. >> >>> + >>> + amdgpu_gem_update_bo_mapping(filp, bo_va, >>> + args->operation, >>> + args->vm_timeline_point, >>> + fence, timeline_syncobj, >>> + timeline_chain); >>> } >>> error: