[AMD Official Use Only] >-----Original Message----- >From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> >Sent: Friday, September 24, 2021 2:37 PM >To: Yu, Lang <Lang.Yu@xxxxxxx>; Koenig, Christian ><Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Cc: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Huang, Ray ><Ray.Huang@xxxxxxx> >Subject: Re: [PATCH] drm/kfd: fix ttm_bo_release warning > >Am 24.09.21 um 08:34 schrieb Yu, Lang: >> [AMD Official Use Only] >> >> >> >>> -----Original Message----- >>> From: Koenig, Christian <Christian.Koenig@xxxxxxx> >>> Sent: Friday, September 24, 2021 1:54 PM >>> To: Yu, Lang <Lang.Yu@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Cc: Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Huang, Ray >>> <Ray.Huang@xxxxxxx> >>> Subject: Re: [PATCH] drm/kfd: fix ttm_bo_release warning >>> >>> >>> Am 24.09.21 um 07:50 schrieb Yu, Lang: >>>> [AMD Official Use Only] >>>>> [SNIP] >>>>>>>> Hi Christian, >>>>>>>> >>>>>>>> Thanks for your explanation and advice. I got your point. >>>>>>>> Actually, these BOs are allocated and pinned during a kfd process >lifecycle. >>>>>>>> I will try to add a flag into struct kgd_mem to indicate which >>>>>>>> BO is pined and should be unpinned. Which will make >>>>>>>> amdgpu_bo_pin/amdgpu_bo_unpin calls balanced. Thanks! >>>>>>> That isn't to much better. The real solution would be to unpin >>>>>>> them when the kfd process is destroyed. >>>>>> Yes, will unpin them when the kfd process is destroyed. >>>>>> But we should indicate which BO is pinned and should be unpinned. Right? >>>>> Well not with a flag or something like that. >>>>> >>>>> The knowledge which BO is pinned and needs to be unpinned should >>>>> come from the control logic and not be papered over by some general >handling. >>>>> That's the background why we have removed the general handling for >>>>> this from TTM in the first place. >>>>> >>>>> In other words when you need to pin a BO because it is kmapped it >>>>> should be unpinned when it is kunmapped and if you don't kunmap at >>>>> all then there is something wrong with the code structure from a >>>>> higher level >>> point of view. >>>> Yes, this function "amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel" did a >>>> kmap, but without a kunmap when the kfd process is destroyed. The >>>> flag >>> actually indicates kmap/kunmap. >>> >>> Well that's the wrong approach then. I mean you need to have the BO >>> reference and the mapped pointer somewhere, don't you? >>> >>> How do you clean those up? >> They are respectively cleaned by " kfd_process_device_free_bos " and " >kfd_process_destroy_pdds". >> Let me put the code here. Thanks! >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h >> index ec028cf963f5..d65b3bf13fd8 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h >> @@ -81,6 +81,7 @@ struct kgd_mem { >> >> bool aql_queue; >> bool is_imported; >> + bool is_mapped_to_kernel; > >Yeah, that is exactly what you absolutely should NOT do. > >> }; >> >> /* KFD Memory Eviction */ >> @@ -278,6 +279,8 @@ int amdgpu_amdkfd_gpuvm_sync_memory( >> struct kgd_dev *kgd, struct kgd_mem *mem, bool intr); >> int amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel(struct kgd_dev *kgd, >> struct kgd_mem *mem, void **kptr, uint64_t *size); > >The real question is who is calling this function here? Currently there are 3 places called function "amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel" to kmap a BO. 1, kmap a ptr for pdd->qpd->cwsr_kaddr Call stack: amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel kfd_process_alloc_gpuvm kfd_process_device_init_cwsr_dgpu kfd_process_device_init_vm kfd_ioctl_acquire_vm 2, kmap a ptr for pdd->qpd->ib_kaddr Call stack: amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel kfd_process_alloc_gpuvm kfd_process_device_reserve_ib_mem kfd_process_device_init_vm kfd_ioctl_acquire_vm 3, kmap a ptr for p->signal_page->kernel_address Call stack: amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel kfd_ioctl_create_event The problem is these kmaped BOs were not kunmaped properly when the kfd process is destroyed. Regards, Lang >> +void amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(struct kgd_dev >*kgd, >> + struct kgd_mem *mem); >> int amdgpu_amdkfd_gpuvm_restore_process_bos(void *process_info, >> struct dma_fence **ef); >> int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct kgd_dev *kgd, diff >> --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> index 2d6b2d77b738..45ccbe9f63ee 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> @@ -1857,6 +1857,8 @@ int >> amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel(struct kgd_dev *kgd, >> >> amdgpu_bo_unreserve(bo); >> >> + mem->is_mapped_to_kernel = true; >> + >> mutex_unlock(&mem->process_info->lock); >> return 0; >> >> @@ -1870,6 +1872,20 @@ int >amdgpu_amdkfd_gpuvm_map_gtt_bo_to_kernel(struct kgd_dev *kgd, >> return ret; >> } >> >> +void amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(struct kgd_dev >> +*kgd, struct kgd_mem *mem) { >> + struct amdgpu_bo *bo = mem->bo; >> + >> + if (!mem->is_mapped_to_kernel) >> + return; >> + >> + amdgpu_bo_reserve(bo, true); >> + amdgpu_bo_kunmap(bo); >> + amdgpu_bo_unpin(bo); >> + amdgpu_bo_unreserve(bo); >> + mem->is_mapped_to_kernel = false; } >> + >> int amdgpu_amdkfd_gpuvm_get_vm_fault_info(struct kgd_dev *kgd, >> struct kfd_vm_fault_info *mem) >> { >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_process.c >> index 21ec8a18cad2..f5506b153aed 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c >> @@ -941,6 +941,8 @@ static void kfd_process_device_free_bos(struct >kfd_process_device *pdd) >> peer_pdd->dev->kgd, mem, peer_pdd->drm_priv); >> } >> >> + >> + amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel(pdd->dev->kgd, mem); >> + > >That's a general cleanup function for user space allocations and should not be >abused for stuff like that. > >Regards, >Christian. > >> amdgpu_amdkfd_gpuvm_free_memory_of_gpu(pdd->dev->kgd, mem, >> pdd->drm_priv, NULL); >> kfd_process_device_remove_obj_handle(pdd, id); >> >> Regards, >> Lang >> >>> Regards, >>> Christian.