RE: [PATCH] drm/amdgpu: Handle duplicate BOs during process restore

"Joshi, Mukul" <Mukul.Joshi@xxxxxxx> · Mon, 11 Mar 2024 15:26:03 +0000

[AMD Official Use Only - General]

> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@xxxxxxx>
> Sent: Monday, March 11, 2024 4:13 AM
> To: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Joshi, Mukul
> <Mukul.Joshi@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [PATCH] drm/amdgpu: Handle duplicate BOs during process
> restore
>
>
>
> On 3/8/2024 10:17 PM, Felix Kuehling wrote:
> > On 2024-03-08 11:22, Mukul Joshi wrote:
> >> In certain situations, some apps can import a BO multiple times
> >> (through IPC for example). To restore such processes successfully, we
> >> need to tell drm to ignore duplicate BOs.
> >> While at it, also add additional logging to prevent silent failures
> >> when process restore fails.
> >>
> >> Signed-off-by: Mukul Joshi <mukul.joshi@xxxxxxx>
> >> ---
> >>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 14
> >> ++++++++++----
> >>   1 file changed, 10 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >> index bf8e6653341f..65d808d8b5da 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> >> @@ -2869,14 +2869,16 @@ int
> >> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct
> dma_fence
> >> __rcu *
> >>         mutex_lock(&process_info->lock);
> >>   -    drm_exec_init(&exec, 0);
> >> +    drm_exec_init(&exec, DRM_EXEC_IGNORE_DUPLICATES);
> >>       drm_exec_until_all_locked(&exec) {
> >>           list_for_each_entry(peer_vm, &process_info->vm_list_head,
> >>                       vm_list_node) {
> >>               ret = amdgpu_vm_lock_pd(peer_vm, &exec, 2);
> >>               drm_exec_retry_on_contention(&exec);
> >> -            if (unlikely(ret))
> >> +            if (unlikely(ret)) {
> >> +                pr_err("Locking VM PD failed, ret: %d\n", ret);
> >
> > pr_err makes sense here as it indicates a persistent problem that
> > would cause soft hangs, like in this case.
> >
> >
> >>                   goto ttm_reserve_fail;
> >> +            }
> >>           }
> >>             /* Reserve all BOs and page tables/directory. Add all BOs
> >> from @@ -2889,8 +2891,10 @@ int
> >> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct
> dma_fence
> >> __rcu *
> >>               gobj = &mem->bo->tbo.base;
> >>               ret = drm_exec_prepare_obj(&exec, gobj, 1);
> >>               drm_exec_retry_on_contention(&exec);
> >> -            if (unlikely(ret))
> >> +            if (unlikely(ret)) {
> >> +                pr_err("drm_exec_prepare_obj failed, ret: %d\n",
> >> +ret);
> >
> > Same here, pr_err is fine.
> >
>
> These kind of prints - "<func name> failed <error code>" - are way too generic
> and if more like this are added, it will be difficult to find out even where these
> are coming from.

Will send a follow up patch to put a more meaningful message here.

Thanks,
Mukul

>
> It's always better to have a context so that this translates to some useful
> information in dmesg - minimum context is the device or bo details or
> anything of that sort.
>
> Thanks,
> Lijo
>
> >
> >>                   goto ttm_reserve_fail;
> >> +            }
> >>           }
> >>       }
> >>   @@ -2950,8 +2954,10 @@ int
> >> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct
> dma_fence
> >> __rcu *
> >>        * validations above would invalidate DMABuf imports again.
> >>        */
> >>       ret = process_validate_vms(process_info, &exec.ticket);
> >> -    if (ret)
> >> +    if (ret) {
> >> +        pr_err("Validating VMs failed, ret: %d\n", ret);
> >
> > I'd make this a pr_debug to avoid spamming the log. validation can
> > fail intermittently and rescheduling the worker is there to handle it.
> >
> > With that fixed, the patch is
> >
> > Reviewed-by: Felix Kuehling <felix.kuehling@xxxxxxx>
> >
> >
> >>           goto validate_map_fail;
> >> +    }
> >>         /* Update mappings not managed by KFD */
> >>       list_for_each_entry(peer_vm, &process_info->vm_list_head,