On 3/8/2024 10:17 PM, Felix Kuehling wrote: > On 2024-03-08 11:22, Mukul Joshi wrote: >> In certain situations, some apps can import a BO multiple times >> (through IPC for example). To restore such processes successfully, >> we need to tell drm to ignore duplicate BOs. >> While at it, also add additional logging to prevent silent failures >> when process restore fails. >> >> Signed-off-by: Mukul Joshi <mukul.joshi@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 14 ++++++++++---- >> 1 file changed, 10 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> index bf8e6653341f..65d808d8b5da 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> @@ -2869,14 +2869,16 @@ int >> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence >> __rcu * >> mutex_lock(&process_info->lock); >> - drm_exec_init(&exec, 0); >> + drm_exec_init(&exec, DRM_EXEC_IGNORE_DUPLICATES); >> drm_exec_until_all_locked(&exec) { >> list_for_each_entry(peer_vm, &process_info->vm_list_head, >> vm_list_node) { >> ret = amdgpu_vm_lock_pd(peer_vm, &exec, 2); >> drm_exec_retry_on_contention(&exec); >> - if (unlikely(ret)) >> + if (unlikely(ret)) { >> + pr_err("Locking VM PD failed, ret: %d\n", ret); > > pr_err makes sense here as it indicates a persistent problem that would > cause soft hangs, like in this case. > > >> goto ttm_reserve_fail; >> + } >> } >> /* Reserve all BOs and page tables/directory. Add all BOs >> from >> @@ -2889,8 +2891,10 @@ int >> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence >> __rcu * >> gobj = &mem->bo->tbo.base; >> ret = drm_exec_prepare_obj(&exec, gobj, 1); >> drm_exec_retry_on_contention(&exec); >> - if (unlikely(ret)) >> + if (unlikely(ret)) { >> + pr_err("drm_exec_prepare_obj failed, ret: %d\n", ret); > > Same here, pr_err is fine. > These kind of prints - "<func name> failed <error code>" - are way too generic and if more like this are added, it will be difficult to find out even where these are coming from. It's always better to have a context so that this translates to some useful information in dmesg - minimum context is the device or bo details or anything of that sort. Thanks, Lijo > >> goto ttm_reserve_fail; >> + } >> } >> } >> @@ -2950,8 +2954,10 @@ int >> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence >> __rcu * >> * validations above would invalidate DMABuf imports again. >> */ >> ret = process_validate_vms(process_info, &exec.ticket); >> - if (ret) >> + if (ret) { >> + pr_err("Validating VMs failed, ret: %d\n", ret); > > I'd make this a pr_debug to avoid spamming the log. validation can fail > intermittently and rescheduling the worker is there to handle it. > > With that fixed, the patch is > > Reviewed-by: Felix Kuehling <felix.kuehling@xxxxxxx> > > >> goto validate_map_fail; >> + } >> /* Update mappings not managed by KFD */ >> list_for_each_entry(peer_vm, &process_info->vm_list_head,