[AMD Official Use Only] I am not sure if we can create a ttm_bo_type_sg bo for userptr. But I have another idea now. we can use flag AMDGPU_AMDKFD_CREATE_USERPTR_BO to create the userptr bo. ________________________________________ 发件人: Kuehling, Felix <Felix.Kuehling@xxxxxxx> 发送时间: 2021年5月19日 23:11 收件人: Christian König; Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx 抄送: Deucher, Alexander; daniel@xxxxxxxx; Koenig, Christian; dri-devel@xxxxxxxxxxxxxxxxxxxxx 主题: Re: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to swapout and swapin Looks like we're creating the userptr BO as ttm_bo_type_device. I guess we should be using ttm_bo_type_sg? BTW, amdgpu_gem_userptr_ioctl also uses ttm_bo_type_device. Regards, Felix Am 2021-05-19 um 6:01 a.m. schrieb Christian König: > I'm scratching my head how that is even possible. > > See when a BO is created in the system domain it is just an empty > hull, e.g. without backing store and allocated pages. > > So the swapout function will just ignore it. > > Christian. > > Am 19.05.21 um 07:07 schrieb Pan, Xinhui: >> [AMD Official Use Only] >> >> I have reverted Chris' patch, still hit this failure. >> Just see two lines in Chris' patch. Any BO in cpu domian would be >> swapout first. That is why we hit this issue frequently now. But the >> bug is there long time ago. >> >> - for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) { >> - list_for_each_entry(bo, &glob->swap_lru[i], swap) { >> [snip] >> + for (i = TTM_PL_SYSTEM; i < TTM_NUM_MEM_TYPES; ++i) { >> + for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) { >> >> >> ________________________________________ >> 发件人: Pan, Xinhui <Xinhui.Pan@xxxxxxx> >> 发送时间: 2021年5月19日 12:09 >> 收件人: Kuehling, Felix; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> 抄送: Deucher, Alexander; Koenig, Christian; >> dri-devel@xxxxxxxxxxxxxxxxxxxxx; daniel@xxxxxxxx >> 主题: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to >> swapout and swapin >> >> yes, we really dont swapout SG BOs. >> The problems is that before we validate a userptr BO, we create this >> BO in CPU domain by default. So this BO has chance to swapout. >> >> we set flag TTM_PAGE_FLAG_SG on userptr BO in popluate() which is too >> late. >> I have not try to revert Chris' patch as I think it desnt help. Or I >> can have a try later. >> >> ________________________________________ >> 发件人: Kuehling, Felix <Felix.Kuehling@xxxxxxx> >> 发送时间: 2021年5月19日 11:29 >> 收件人: Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> 抄送: Deucher, Alexander; Koenig, Christian; >> dri-devel@xxxxxxxxxxxxxxxxxxxxx; daniel@xxxxxxxx >> 主题: Re: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to >> swapout and swapin >> >> Swapping SG BOs makes no sense, because TTM doesn't own the pages of >> this type of BO. >> >> Last I checked, userptr BOs (and other SG BOs) were protected from >> swapout by the fact that they would not be added to the swap-LRU. But it >> looks like Christian just removed the swap-LRU. I guess this broke that >> protection: >> >> commit 2cb51d22d70b18eaf339abf9758bf0b7608da65c >> Author: Christian König <christian.koenig@xxxxxxx> >> Date: Tue Oct 6 16:30:09 2020 +0200 >> >> drm/ttm: remove swap LRU v3 >> >> Instead evict round robin from each devices SYSTEM and TT domain. >> >> v2: reorder num_pages access reported by Dan's script >> v3: fix rebase fallout, num_pages should be 32bit >> >> Signed-off-by: Christian König <christian.koenig@xxxxxxx> >> Tested-by: Nirmoy Das <nirmoy.das@xxxxxxx> >> Reviewed-by: Huang Rui <ray.huang@xxxxxxx> >> Reviewed-by: Matthew Auld <matthew.auld@xxxxxxxxx> >> Link: https://patchwork.freedesktop.org/patch/424009/ >> >> Regards, >> Felix >> >> >> On 2021-05-18 10:28 p.m., xinhui pan wrote: >>> cpu 1 cpu 2 >>> kfd alloc BO A(userptr) alloc BO B(GTT) >>> ->init -> validate -> init -> >>> validate -> populate >>> init_user_pages -> swapout BO A >>> //hit ttm pages limit >>> -> get_user_pages (fill up ttm->pages) >>> -> validate -> populate >>> -> swapin BO A // Now hit the BUG >>> >>> We know that get_user_pages may race with swapout on same BO. >>> Threre are some issues I have met. >>> 1) memory corruption. >>> This is because we do a swap before memory is setup. ttm_tt_swapout() >>> just create a swap_storage with its content being 0x0. So when we setup >>> memory after the swapout. The following swapin makes the memory >>> corrupted. >>> >>> 2) panic >>> When swapout happes with get_user_pages, they touch ttm->pages without >>> anylock. It causes memory corruption too. But I hit page fault mostly. >>> >>> Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 16 >>> +++++++++++++++- >>> 1 file changed, 15 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> index 928e8d57cd08..42460e4480f8 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >>> @@ -835,6 +835,7 @@ static int init_user_pages(struct kgd_mem *mem, >>> uint64_t user_addr) >>> struct amdkfd_process_info *process_info = mem->process_info; >>> struct amdgpu_bo *bo = mem->bo; >>> struct ttm_operation_ctx ctx = { true, false }; >>> + struct page **pages; >>> int ret = 0; >>> >>> mutex_lock(&process_info->lock); >>> @@ -852,7 +853,13 @@ static int init_user_pages(struct kgd_mem *mem, >>> uint64_t user_addr) >>> goto out; >>> } >>> >>> - ret = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages); >>> + pages = kvmalloc_array(bo->tbo.ttm->num_pages, >>> + sizeof(struct page *), >>> + GFP_KERNEL | __GFP_ZERO); >>> + if (!pages) >>> + goto unregister_out; >>> + >>> + ret = amdgpu_ttm_tt_get_user_pages(bo, pages); >>> if (ret) { >>> pr_err("%s: Failed to get user pages: %d\n", >>> __func__, ret); >>> goto unregister_out; >>> @@ -863,6 +870,12 @@ static int init_user_pages(struct kgd_mem *mem, >>> uint64_t user_addr) >>> pr_err("%s: Failed to reserve BO\n", __func__); >>> goto release_out; >>> } >>> + >>> + WARN_ON_ONCE(bo->tbo.ttm->page_flags & TTM_PAGE_FLAG_SWAPPED); >>> + >>> + memcpy(bo->tbo.ttm->pages, >>> + pages, >>> + sizeof(struct page*) * bo->tbo.ttm->num_pages); >>> amdgpu_bo_placement_from_domain(bo, mem->domain); >>> ret = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx); >>> if (ret) >>> @@ -872,6 +885,7 @@ static int init_user_pages(struct kgd_mem *mem, >>> uint64_t user_addr) >>> release_out: >>> amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm); >>> unregister_out: >>> + kvfree(pages); >>> if (ret) >>> amdgpu_mn_unregister(bo); >>> out: >> _______________________________________________ >> amd-gfx mailing list >> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >