RE: [PATCH] drm/amdkfd: reserve the BO before validating it

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Public]

>-----Original Message-----
>From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
>Sent: Friday, January 12, 2024 12:19 AM
>To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Francis, David
><David.Francis@xxxxxxx>
>Cc: Yang, Philip <Philip.Yang@xxxxxxx>
>Subject: Re: [PATCH] drm/amdkfd: reserve the BO before validating it
>
>On 2024-01-11 02:22, Lang Yu wrote:
>> Fixes: 410f08516e0f ("drm/amdkfd: Move dma unmapping after TLB flush")
>>
>> [   41.708711] WARNING: CPU: 0 PID: 1463 at
>drivers/gpu/drm/ttm/ttm_bo.c:846 ttm_bo_validate+0x146/0x1b0 [ttm]
>> [   41.708989] Call Trace:
>> [   41.708992]  <TASK>
>> [   41.708996]  ? show_regs+0x6c/0x80
>> [   41.709000]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
>> [   41.709008]  ? __warn+0x93/0x190
>> [   41.709014]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
>> [   41.709024]  ? report_bug+0x1f9/0x210
>> [   41.709035]  ? handle_bug+0x46/0x80
>> [   41.709041]  ? exc_invalid_op+0x1d/0x80
>> [   41.709048]  ? asm_exc_invalid_op+0x1f/0x30
>> [   41.709057]  ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80
>[amdgpu]
>> [   41.709185]  ? ttm_bo_validate+0x146/0x1b0 [ttm]
>> [   41.709197]  ? amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x2c/0x80
>[amdgpu]
>> [   41.709337]  ? srso_alias_return_thunk+0x5/0x7f
>> [   41.709346]  kfd_mem_dmaunmap_attachment+0x9e/0x1e0 [amdgpu]
>> [   41.709467]  amdgpu_amdkfd_gpuvm_dmaunmap_mem+0x56/0x80
>[amdgpu]
>> [   41.709586]  kfd_ioctl_unmap_memory_from_gpu+0x1b7/0x300 [amdgpu]
>> [   41.709710]  kfd_ioctl+0x1ec/0x650 [amdgpu]
>> [   41.709822]  ? __pfx_kfd_ioctl_unmap_memory_from_gpu+0x10/0x10
>[amdgpu]
>> [   41.709945]  ? srso_alias_return_thunk+0x5/0x7f
>> [   41.709949]  ? tomoyo_file_ioctl+0x20/0x30
>> [   41.709959]  __x64_sys_ioctl+0x9c/0xd0
>> [   41.709967]  do_syscall_64+0x3f/0x90
>> [   41.709973]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>>
>> Signed-off-by: Lang Yu <Lang.Yu@xxxxxxx>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 48697b789342..f5542a4ab8ed 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -2095,8 +2095,13 @@ void
>amdgpu_amdkfd_gpuvm_dmaunmap_mem(struct kgd_mem *mem, void
>*drm_priv)
>>      mutex_lock(&mem->lock);
>>
>>      list_for_each_entry(entry, &mem->attachments, list) {
>> -            if (entry->bo_va->base.vm == vm)
>> +            if (entry->bo_va->base.vm != vm)
>> +                    continue;
>> +
>> +            if (!WARN_ON(amdgpu_bo_reserve(entry->bo_va->base.bo,
>true))) {
>>                      kfd_mem_dmaunmap_attachment(mem, entry);
>> +                    amdgpu_bo_unreserve(entry->bo_va->base.bo);
>> +            }
>
>I'm pretty sure someone else worked on a fix for this before. This is not a good
>solution. We need to handle failed reservations (due to
>ERESTARTSYS) and make sure that the unmap ioctl can be restarted correctly in
>that case.
>
>See
>https://lore.kernel.org/amd-gfx/530aac57-5561-4d1d-879a-
>93b108e5c8c2@xxxxxxxxx/

Got it. Thanks.

Christian's concern is
"Kernel operations should either completely fail, fully complete or
explicitly return how much they completed (e.g. how many bytes
transferred for example). That we only partially complete and track that
state inside the kernel is usually a no-go."

ERESTART_RESTARTBLOCK is for partially restart and may help.

Regards,
Lang

>David, do you have any update on this work?
>
>Regards,
>   Felix
>
>
>>      }
>>
>>      mutex_unlock(&mem->lock);




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux