RE: [PATCH] drm/amdgpu: Handle duplicate BOs during process restore

"Joshi, Mukul" <Mukul.Joshi@xxxxxxx> · Fri, 8 Mar 2024 17:35:00 +0000

[AMD Official Use Only - General]

> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
> Sent: Friday, March 8, 2024 11:48 AM
> To: Joshi, Mukul <Mukul.Joshi@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [PATCH] drm/amdgpu: Handle duplicate BOs during process
> restore
>
> On 2024-03-08 11:22, Mukul Joshi wrote:
> > In certain situations, some apps can import a BO multiple times
> > (through IPC for example). To restore such processes successfully, we
> > need to tell drm to ignore duplicate BOs.
> > While at it, also add additional logging to prevent silent failures
> > when process restore fails.
> >
> > Signed-off-by: Mukul Joshi <mukul.joshi@xxxxxxx>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 14
> ++++++++++----
> >   1 file changed, 10 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> > index bf8e6653341f..65d808d8b5da 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> > @@ -2869,14 +2869,16 @@ int
> > amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence
> > __rcu *
> >
> >     mutex_lock(&process_info->lock);
> >
> > -   drm_exec_init(&exec, 0);
> > +   drm_exec_init(&exec, DRM_EXEC_IGNORE_DUPLICATES);
> >     drm_exec_until_all_locked(&exec) {
> >             list_for_each_entry(peer_vm, &process_info->vm_list_head,
> >                                 vm_list_node) {
> >                     ret = amdgpu_vm_lock_pd(peer_vm, &exec, 2);
> >                     drm_exec_retry_on_contention(&exec);
> > -                   if (unlikely(ret))
> > +                   if (unlikely(ret)) {
> > +                           pr_err("Locking VM PD failed, ret: %d\n", ret);
>
> pr_err makes sense here as it indicates a persistent problem that would cause
> soft hangs, like in this case.
>
>
> >                             goto ttm_reserve_fail;
> > +                   }
> >             }
> >
> >             /* Reserve all BOs and page tables/directory. Add all BOs from
> > @@ -2889,8 +2891,10 @@ int
> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence
> __rcu *
> >                     gobj = &mem->bo->tbo.base;
> >                     ret = drm_exec_prepare_obj(&exec, gobj, 1);
> >                     drm_exec_retry_on_contention(&exec);
> > -                   if (unlikely(ret))
> > +                   if (unlikely(ret)) {
> > +                           pr_err("drm_exec_prepare_obj failed, ret:
> %d\n", ret);
>
> Same here, pr_err is fine.
>
>
> >                             goto ttm_reserve_fail;
> > +                   }
> >             }
> >     }
> >
> > @@ -2950,8 +2954,10 @@ int
> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence
> __rcu *
> >      * validations above would invalidate DMABuf imports again.
> >      */
> >     ret = process_validate_vms(process_info, &exec.ticket);
> > -   if (ret)
> > +   if (ret) {
> > +           pr_err("Validating VMs failed, ret: %d\n", ret);
>
> I'd make this a pr_debug to avoid spamming the log. validation can fail
> intermittently and rescheduling the worker is there to handle it.

Will update this to pr_debug before submitting the patch. Thank you.

Regards,
Mukul
>
> With that fixed, the patch is
>
> Reviewed-by: Felix Kuehling <felix.kuehling@xxxxxxx>
>
>
> >             goto validate_map_fail;
> > +   }
> >
> >     /* Update mappings not managed by KFD */
> >     list_for_each_entry(peer_vm, &process_info->vm_list_head,