[Public] >-----Original Message----- >From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> >Sent: Friday, January 12, 2024 12:19 AM >To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Francis, David ><David.Francis@xxxxxxx> >Cc: Yang, Philip <Philip.Yang@xxxxxxx> >Subject: Re: [PATCH] drm/amdkfd: reserve the BO before validating it > >On 2024-01-11 02:22, Lang Yu wrote: >> Fixes: 410f08516e0f ("drm/amdkfd: Move dma unmapping after TLB flush") >> >> [ 41.708711] WARNING: CPU: 0 PID: 1463 at >drivers/gpu/drm/ttm/ttm_bo.c:846 ttm_bo_validate+0x146/0x1b0 [ttm] >> [ 41.708989] Call Trace: >> [ 41.708992] <TASK> >> [ 41.708996] ? show_regs+0x6c/0x80 >> [ 41.709000] ? ttm_bo_validate+0x146/0x1b0 [ttm] >> [ 41.709008] ? __warn+0x93/0x190 >> [ 41.709014] ? ttm_bo_validate+0x146/0x1b0 [ttm] >> [ 41.709024] ? report_bug+0x1f9/0x210 >> [ 41.709035] ? handle_bug+0x46/0x80 >> [ 41.709041] ? exc_invalid_op+0x1d/0x80 >> [ 41.709048] ? asm_exc_invalid_op+0x1f/0x30 >> [ 41.709057] ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80 >[amdgpu] >> [ 41.709185] ? ttm_bo_validate+0x146/0x1b0 [ttm] >> [ 41.709197] ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80 >[amdgpu] >> [ 41.709337] ? srso_alias_return_thunk+0x5/0x7f >> [ 41.709346] kfd_mem_dmaunmap_attachment+0x9e/0x1e0 [amdgpu] >> [ 41.709467] amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x56/0x80 >[amdgpu] >> [ 41.709586] kfd_ioctl_unmap_memory_from_gpu+0x1b7/0x300 [amdgpu] >> [ 41.709710] kfd_ioctl+0x1ec/0x650 [amdgpu] >> [ 41.709822] ? __pfx_kfd_ioctl_unmap_memory_from_gpu+0x10/0x10 >[amdgpu] >> [ 41.709945] ? srso_alias_return_thunk+0x5/0x7f >> [ 41.709949] ? tomoyo_file_ioctl+0x20/0x30 >> [ 41.709959] __x64_sys_ioctl+0x9c/0xd0 >> [ 41.709967] do_syscall_64+0x3f/0x90 >> [ 41.709973] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 >> >> Signed-off-by: Lang Yu <Lang.Yu@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> index 48697b789342..f5542a4ab8ed 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> @@ -2095,8 +2095,13 @@ void >amdgpu_amdkfd_gpuvm_dmaunmap_mem(struct kgd_mem *mem, void >*drm_priv) >> mutex_lock(&mem->lock); >> >> list_for_each_entry(entry, &mem->attachments, list) { >> - if (entry->bo_va->base.vm == vm) >> + if (entry->bo_va->base.vm != vm) >> + continue; >> + >> + if (!WARN_ON(amdgpu_bo_reserve(entry->bo_va->base.bo, >true))) { >> kfd_mem_dmaunmap_attachment(mem, entry); >> + amdgpu_bo_unreserve(entry->bo_va->base.bo); >> + } > >I'm pretty sure someone else worked on a fix for this before. This is not a good >solution. We need to handle failed reservations (due to >ERESTARTSYS) and make sure that the unmap ioctl can be restarted correctly in >that case. > >See >https://lore.kernel.org/amd-gfx/530aac57-5561-4d1d-879a- >93b108e5c8c2@xxxxxxxxx/ Got it. Thanks. Christian's concern is "Kernel operations should either completely fail, fully complete or explicitly return how much they completed (e.g. how many bytes transferred for example). That we only partially complete and track that state inside the kernel is usually a no-go." ERESTART_RESTARTBLOCK is for partially restart and may help. Regards, Lang >David, do you have any update on this work? > >Regards, > Felix > > >> } >> >> mutex_unlock(&mem->lock);