Re: [PATCH] drm/amdgpu: Handle duplicate BOs during process restore

"Lazar, Lijo" <lijo.lazar@xxxxxxx> · Mon, 11 Mar 2024 13:42:53 +0530

On 3/8/2024 10:17 PM, Felix Kuehling wrote:
> On 2024-03-08 11:22, Mukul Joshi wrote:
>> In certain situations, some apps can import a BO multiple times
>> (through IPC for example). To restore such processes successfully,
>> we need to tell drm to ignore duplicate BOs.
>> While at it, also add additional logging to prevent silent failures
>> when process restore fails.
>>
>> Signed-off-by: Mukul Joshi <mukul.joshi@xxxxxxx>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 14 ++++++++++----
>>   1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index bf8e6653341f..65d808d8b5da 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -2869,14 +2869,16 @@ int
>> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence
>> __rcu *
>>         mutex_lock(&process_info->lock);
>>   -    drm_exec_init(&exec, 0);
>> +    drm_exec_init(&exec, DRM_EXEC_IGNORE_DUPLICATES);
>>       drm_exec_until_all_locked(&exec) {
>>           list_for_each_entry(peer_vm, &process_info->vm_list_head,
>>                       vm_list_node) {
>>               ret = amdgpu_vm_lock_pd(peer_vm, &exec, 2);
>>               drm_exec_retry_on_contention(&exec);
>> -            if (unlikely(ret))
>> +            if (unlikely(ret)) {
>> +                pr_err("Locking VM PD failed, ret: %d\n", ret);
> 
> pr_err makes sense here as it indicates a persistent problem that would
> cause soft hangs, like in this case.
> 
> 
>>                   goto ttm_reserve_fail;
>> +            }
>>           }
>>             /* Reserve all BOs and page tables/directory. Add all BOs
>> from
>> @@ -2889,8 +2891,10 @@ int
>> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence
>> __rcu *
>>               gobj = &mem->bo->tbo.base;
>>               ret = drm_exec_prepare_obj(&exec, gobj, 1);
>>               drm_exec_retry_on_contention(&exec);
>> -            if (unlikely(ret))
>> +            if (unlikely(ret)) {
>> +                pr_err("drm_exec_prepare_obj failed, ret: %d\n", ret);
> 
> Same here, pr_err is fine.
> 

These kind of prints - "<func name> failed <error code>" - are way too
generic and if more like this are added, it will be difficult to find
out even where these are coming from.

It's always better to have a context so that this translates to some
useful information in dmesg - minimum context is the device or bo
details or anything of that sort.

Thanks,
Lijo

> 
>>                   goto ttm_reserve_fail;
>> +            }
>>           }
>>       }
>>   @@ -2950,8 +2954,10 @@ int
>> amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence
>> __rcu *
>>        * validations above would invalidate DMABuf imports again.
>>        */
>>       ret = process_validate_vms(process_info, &exec.ticket);
>> -    if (ret)
>> +    if (ret) {
>> +        pr_err("Validating VMs failed, ret: %d\n", ret);
> 
> I'd make this a pr_debug to avoid spamming the log. validation can fail
> intermittently and rescheduling the worker is there to handle it.
> 
> With that fixed, the patch is
> 
> Reviewed-by: Felix Kuehling <felix.kuehling@xxxxxxx>
> 
> 
>>           goto validate_map_fail;
>> +    }
>>         /* Update mappings not managed by KFD */
>>       list_for_each_entry(peer_vm, &process_info->vm_list_head,