Re: [PATCH] drm/ttm: Always take the bo delayed cleanup path for imported bos

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Thu, 04 Jul 2024 10:02:35 +0200

On Mon, 2024-07-01 at 12:43 +0200, Christian König wrote:
> Am 28.06.24 um 20:13 schrieb Daniel Vetter:
> > On Fri, Jun 28, 2024 at 03:51:33PM +0000, Matthew Brost wrote:
> > > On Fri, Jun 28, 2024 at 05:38:48PM +0200, Thomas Hellström wrote:
> > > > Bos can be put with multiple unrelated dma-resv locks held. But
> > > > imported bos attempt to grab the bo dma-resv during dma-buf
> > > > detach
> > > > that typically happens during cleanup. That leads to lockde
> > > > splats
> > > > similar to the below and a potential ABBA deadlock.
> > > > 
> > > > Fix this by always taking the delayed workqueue cleanup path
> > > > for
> > > > imported bos.
> > > > 
> > > > Requesting stable fixes from when the Xe driver was introduced,
> > > > since its usage of drm_exec and wide vm dma_resvs appear to be
> > > > the first reliable trigger of this.
> > > > 
> > > > [22982.116427] ============================================
> > > > [22982.116428] WARNING: possible recursive locking detected
> > > > [22982.116429] 6.10.0-rc2+ #10 Tainted: G     U  W
> > > > [22982.116430] --------------------------------------------
> > > > [22982.116430] glxgears:sh0/5785 is trying to acquire lock:
> > > > [22982.116431] ffff8c2bafa539a8
> > > > (reservation_ww_class_mutex){+.+.}-{3:3}, at:
> > > > dma_buf_detach+0x3b/0xf0
> > > > [22982.116438]
> > > >                 but task is already holding lock:
> > > > [22982.116438] ffff8c2d9aba6da8
> > > > (reservation_ww_class_mutex){+.+.}-{3:3}, at:
> > > > drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
> > > > [22982.116442]
> > > >                 other info that might help us debug this:
> > > > [22982.116442]  Possible unsafe locking scenario:
> > > > 
> > > > [22982.116443]        CPU0
> > > > [22982.116444]        ----
> > > > [22982.116444]   lock(reservation_ww_class_mutex);
> > > > [22982.116445]   lock(reservation_ww_class_mutex);
> > > > [22982.116447]
> > > >                  *** DEADLOCK ***
> > > > 
> > > > [22982.116447]  May be due to missing lock nesting notation
> > > > 
> > > > [22982.116448] 5 locks held by glxgears:sh0/5785:
> > > > [22982.116449]  #0: ffff8c2d9aba58c8 (&xef->vm.lock){+.+.}-
> > > > {3:3}, at: xe_file_close+0xde/0x1c0 [xe]
> > > > [22982.116507]  #1: ffff8c2e28cc8480 (&vm->lock){++++}-{3:3},
> > > > at: xe_vm_close_and_put+0x161/0x9b0 [xe]
> > > > [22982.116578]  #2: ffff8c2e31982970 (&val->lock){.+.+}-{3:3},
> > > > at: xe_validation_ctx_init+0x6d/0x70 [xe]
> > > > [22982.116647]  #3: ffffacdc469478a8
> > > > (reservation_ww_class_acquire){+.+.}-{0:0}, at:
> > > > xe_vma_destroy_unlocked+0x7f/0xe0 [xe]
> > > > [22982.116716]  #4: ffff8c2d9aba6da8
> > > > (reservation_ww_class_mutex){+.+.}-{3:3}, at:
> > > > drm_exec_lock_obj+0x49/0x2b0 [drm_exec]
> > > > [22982.116719]
> > > >                 stack backtrace:
> > > > [22982.116720] CPU: 8 PID: 5785 Comm: glxgears:sh0 Tainted:
> > > > G     U  W          6.10.0-rc2+ #10
> > > > [22982.116721] Hardware name: ASUS System Product Name/PRIME
> > > > B560M-A AC, BIOS 2001 02/01/2023
> > > > [22982.116723] Call Trace:
> > > > [22982.116724]  <TASK>
> > > > [22982.116725]  dump_stack_lvl+0x77/0xb0
> > > > [22982.116727]  __lock_acquire+0x1232/0x2160
> > > > [22982.116730]  lock_acquire+0xcb/0x2d0
> > > > [22982.116732]  ? dma_buf_detach+0x3b/0xf0
> > > > [22982.116734]  ? __lock_acquire+0x417/0x2160
> > > > [22982.116736]  __ww_mutex_lock.constprop.0+0xd0/0x13b0
> > > > [22982.116738]  ? dma_buf_detach+0x3b/0xf0
> > > > [22982.116741]  ? dma_buf_detach+0x3b/0xf0
> > > > [22982.116743]  ? ww_mutex_lock+0x2b/0x90
> > > > [22982.116745]  ww_mutex_lock+0x2b/0x90
> > > > [22982.116747]  dma_buf_detach+0x3b/0xf0
> > > > [22982.116749]  drm_prime_gem_destroy+0x2f/0x40 [drm]
> > > > [22982.116775]  xe_ttm_bo_destroy+0x32/0x220 [xe]
> > > > [22982.116818]  ? __mutex_unlock_slowpath+0x3a/0x290
> > > > [22982.116821]  drm_exec_unlock_all+0xa1/0xd0 [drm_exec]
> > > > [22982.116823]  drm_exec_fini+0x12/0xb0 [drm_exec]
> > > > [22982.116824]  xe_validation_ctx_fini+0x15/0x40 [xe]
> > > > [22982.116892]  xe_vma_destroy_unlocked+0xb1/0xe0 [xe]
> > > > [22982.116959]  xe_vm_close_and_put+0x41a/0x9b0 [xe]
> > > > [22982.117025]  ? xa_find+0xe3/0x1e0
> > > > [22982.117028]  xe_file_close+0x10a/0x1c0 [xe]
> > > > [22982.117074]  drm_file_free+0x22a/0x280 [drm]
> > > > [22982.117099]  drm_release_noglobal+0x22/0x70 [drm]
> > > > [22982.117119]  __fput+0xf1/0x2d0
> > > > [22982.117122]  task_work_run+0x59/0x90
> > > > [22982.117125]  do_exit+0x330/0xb40
> > > > [22982.117127]  do_group_exit+0x36/0xa0
> > > > [22982.117129]  get_signal+0xbd2/0xbe0
> > > > [22982.117131]  arch_do_signal_or_restart+0x3e/0x240
> > > > [22982.117134]  syscall_exit_to_user_mode+0x1e7/0x290
> > > > [22982.117137]  do_syscall_64+0xa1/0x180
> > > > [22982.117139]  ? lock_acquire+0xcb/0x2d0
> > > > [22982.117140]  ? __set_task_comm+0x28/0x1e0
> > > > [22982.117141]  ? find_held_lock+0x2b/0x80
> > > > [22982.117144]  ? __set_task_comm+0xe1/0x1e0
> > > > [22982.117145]  ? lock_release+0xca/0x290
> > > > [22982.117147]  ? __do_sys_prctl+0x245/0xab0
> > > > [22982.117149]  ? lockdep_hardirqs_on_prepare+0xde/0x190
> > > > [22982.117150]  ? syscall_exit_to_user_mode+0xb0/0x290
> > > > [22982.117152]  ? do_syscall_64+0xa1/0x180
> > > > [22982.117154]  ? __lock_acquire+0x417/0x2160
> > > > [22982.117155]  ? reacquire_held_locks+0xd1/0x1f0
> > > > [22982.117156]  ? do_user_addr_fault+0x30c/0x790
> > > > [22982.117158]  ? lock_acquire+0xcb/0x2d0
> > > > [22982.117160]  ? find_held_lock+0x2b/0x80
> > > > [22982.117162]  ? do_user_addr_fault+0x357/0x790
> > > > [22982.117163]  ? lock_release+0xca/0x290
> > > > [22982.117164]  ? do_user_addr_fault+0x361/0x790
> > > > [22982.117166]  ? trace_hardirqs_off+0x4b/0xc0
> > > > [22982.117168]  ? clear_bhb_loop+0x45/0xa0
> > > > [22982.117170]  ? clear_bhb_loop+0x45/0xa0
> > > > [22982.117172]  ? clear_bhb_loop+0x45/0xa0
> > > > [22982.117174]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > > [22982.117176] RIP: 0033:0x7f943d267169
> > > > [22982.117192] Code: Unable to access opcode bytes at
> > > > 0x7f943d26713f.
> > > > [22982.117193] RSP: 002b:00007f9430bffc80 EFLAGS: 00000246
> > > > ORIG_RAX: 00000000000000ca
> > > > [22982.117195] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX:
> > > > 00007f943d267169
> > > > [22982.117196] RDX: 0000000000000000 RSI: 0000000000000189 RDI:
> > > > 00005622f89579d0
> > > > [22982.117197] RBP: 00007f9430bffcb0 R08: 0000000000000000 R09:
> > > > 00000000ffffffff
> > > > [22982.117198] R10: 0000000000000000 R11: 0000000000000246 R12:
> > > > 0000000000000000
> > > > [22982.117199] R13: 0000000000000000 R14: 0000000000000000 R15:
> > > > 00005622f89579d0
> > > > [22982.117202]  </TASK>
> > > > 
> > > > Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for
> > > > Intel GPUs")
> > > > Cc: Christian König <christian.koenig@xxxxxxx>
> > > > Cc: Daniel Vetter <daniel@xxxxxxxx>
> > > > Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx
> > > > Cc: intel-xe@xxxxxxxxxxxxxxxxxxxxx
> > > > Cc: <stable@xxxxxxxxxxxxxxx> # v6.8+
> > > Patch and explaination makes sense to me.
> > > 
> > > Reviewed-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > I guess for stable this is good since minimal, but after Thomas
> > explaine
> > what he meant with dma_buf_detach_unlocked I think that total sense
> > as a
> > follow-up patch maybe.
> > 
> > Reviewed-by: Daniel Vetter <daniel.vetter@xxxxxxxx>
> 
> IIRC the last time this issue came up a possible solution discussed
> was 
> to change dma_buf_detach() so that it should always be called with
> the 
> reservation lock held.
> 
> The problem at the time was that this meant changing 
> dmabuf->ops->detach() to also be called with that lock held and after
> that nobody looked into it any more.
> 
> Anyway Reviewed-by: Christian König <christian.koenig@xxxxxxx> for
> this 
> patch.

Thanks for review. It looks like a dma-buf fix would involve a number
of drivers so pushed this to drm-misc-fixes for now. It looks like both
radeon and amdgpu may potentially suffer from this. Nouveau does detach
in gem_release which looks incorrect if gpu access is going on at that
point.

So will probably need to be properly revisited.

/Thomas

> 
> Regards,
> Christian.
> 
> > 
> > Cheers, Sima
> > > > Signed-off-by: Thomas Hellström
> > > > <thomas.hellstrom@xxxxxxxxxxxxxxx>
> > > > ---
> > > >   drivers/gpu/drm/ttm/ttm_bo.c | 1 +
> > > >   1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c
> > > > b/drivers/gpu/drm/ttm/ttm_bo.c
> > > > index 6396dece0db1..2427be8bc97f 100644
> > > > --- a/drivers/gpu/drm/ttm/ttm_bo.c
> > > > +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> > > > @@ -346,6 +346,7 @@ static void ttm_bo_release(struct kref
> > > > *kref)
> > > >   		if (!dma_resv_test_signaled(bo->base.resv,
> > > >   					   
> > > > DMA_RESV_USAGE_BOOKKEEP) ||
> > > >   		    (want_init_on_free() && (bo->ttm != NULL))
> > > > ||
> > > > +		    bo->type == ttm_bo_type_sg ||
> > > >   		    !dma_resv_trylock(bo->base.resv)) {
> > > >   			/* The BO is not idle, resurrect it
> > > > for delayed destroy */
> > > >   			ttm_bo_flush_all_fences(bo);
> > > > -- 
> > > > 2.44.0
> > > > 
>