[AMD Official Use Only] yes, we really dont swapout SG BOs. The problems is that before we validate a userptr BO, we create this BO in CPU domain by default. So this BO has chance to swapout. we set flag TTM_PAGE_FLAG_SG on userptr BO in popluate() which is too late. I have not try to revert Chris' patch as I think it desnt help. Or I can have a try later. ________________________________________ 发件人: Kuehling, Felix <Felix.Kuehling@xxxxxxx> 发送时间: 2021年5月19日 11:29 收件人: Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx 抄送: Deucher, Alexander; Koenig, Christian; dri-devel@xxxxxxxxxxxxxxxxxxxxx; daniel@xxxxxxxx 主题: Re: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to swapout and swapin Swapping SG BOs makes no sense, because TTM doesn't own the pages of this type of BO. Last I checked, userptr BOs (and other SG BOs) were protected from swapout by the fact that they would not be added to the swap-LRU. But it looks like Christian just removed the swap-LRU. I guess this broke that protection: commit 2cb51d22d70b18eaf339abf9758bf0b7608da65c Author: Christian König <christian.koenig@xxxxxxx> Date: Tue Oct 6 16:30:09 2020 +0200 drm/ttm: remove swap LRU v3 Instead evict round robin from each devices SYSTEM and TT domain. v2: reorder num_pages access reported by Dan's script v3: fix rebase fallout, num_pages should be 32bit Signed-off-by: Christian König <christian.koenig@xxxxxxx> Tested-by: Nirmoy Das <nirmoy.das@xxxxxxx> Reviewed-by: Huang Rui <ray.huang@xxxxxxx> Reviewed-by: Matthew Auld <matthew.auld@xxxxxxxxx> Link: https://patchwork.freedesktop.org/patch/424009/ Regards, Felix On 2021-05-18 10:28 p.m., xinhui pan wrote: > cpu 1 cpu 2 > kfd alloc BO A(userptr) alloc BO B(GTT) > ->init -> validate -> init -> validate -> populate > init_user_pages -> swapout BO A //hit ttm pages limit > -> get_user_pages (fill up ttm->pages) > -> validate -> populate > -> swapin BO A // Now hit the BUG > > We know that get_user_pages may race with swapout on same BO. > Threre are some issues I have met. > 1) memory corruption. > This is because we do a swap before memory is setup. ttm_tt_swapout() > just create a swap_storage with its content being 0x0. So when we setup > memory after the swapout. The following swapin makes the memory > corrupted. > > 2) panic > When swapout happes with get_user_pages, they touch ttm->pages without > anylock. It causes memory corruption too. But I hit page fault mostly. > > Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 16 +++++++++++++++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > index 928e8d57cd08..42460e4480f8 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c > @@ -835,6 +835,7 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr) > struct amdkfd_process_info *process_info = mem->process_info; > struct amdgpu_bo *bo = mem->bo; > struct ttm_operation_ctx ctx = { true, false }; > + struct page **pages; > int ret = 0; > > mutex_lock(&process_info->lock); > @@ -852,7 +853,13 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr) > goto out; > } > > - ret = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages); > + pages = kvmalloc_array(bo->tbo.ttm->num_pages, > + sizeof(struct page *), > + GFP_KERNEL | __GFP_ZERO); > + if (!pages) > + goto unregister_out; > + > + ret = amdgpu_ttm_tt_get_user_pages(bo, pages); > if (ret) { > pr_err("%s: Failed to get user pages: %d\n", __func__, ret); > goto unregister_out; > @@ -863,6 +870,12 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr) > pr_err("%s: Failed to reserve BO\n", __func__); > goto release_out; > } > + > + WARN_ON_ONCE(bo->tbo.ttm->page_flags & TTM_PAGE_FLAG_SWAPPED); > + > + memcpy(bo->tbo.ttm->pages, > + pages, > + sizeof(struct page*) * bo->tbo.ttm->num_pages); > amdgpu_bo_placement_from_domain(bo, mem->domain); > ret = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx); > if (ret) > @@ -872,6 +885,7 @@ static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr) > release_out: > amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm); > unregister_out: > + kvfree(pages); > if (ret) > amdgpu_mn_unregister(bo); > out: