Re: 回复: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to swapout and swapin

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The problem is that ttm_bo_type_sg doesn't allocate a page array for the TT object.

Christian.

Am 20.05.21 um 04:58 schrieb Pan, Xinhui:
[AMD Official Use Only]

I am not sure if we can create a ttm_bo_type_sg bo for userptr. But I have another idea now. we can use flag AMDGPU_AMDKFD_CREATE_USERPTR_BO to create the userptr bo.
________________________________________
发件人: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
发送时间: 2021年5月19日 23:11
收件人: Christian König; Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
抄送: Deucher, Alexander; daniel@xxxxxxxx; Koenig, Christian; dri-devel@xxxxxxxxxxxxxxxxxxxxx
主题: Re: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to swapout and swapin

Looks like we're creating the userptr BO as ttm_bo_type_device. I guess
we should be using ttm_bo_type_sg? BTW, amdgpu_gem_userptr_ioctl also
uses ttm_bo_type_device.

Regards,
   Felix


Am 2021-05-19 um 6:01 a.m. schrieb Christian König:
I'm scratching my head how that is even possible.

See when a BO is created in the system domain it is just an empty
hull, e.g. without backing store and allocated pages.

So the swapout function will just ignore it.

Christian.

Am 19.05.21 um 07:07 schrieb Pan, Xinhui:
[AMD Official Use Only]

I have reverted Chris'  patch, still hit this failure.
Just see two lines in Chris' patch. Any BO in cpu domian would be
swapout first. That is why we hit this issue frequently now. But the
bug is there long time ago.

-       for (i = 0; i < TTM_MAX_BO_PRIORITY; ++i) {
-               list_for_each_entry(bo, &glob->swap_lru[i], swap) {
[snip]
+       for (i = TTM_PL_SYSTEM; i < TTM_NUM_MEM_TYPES; ++i) {
+               for (j = 0; j < TTM_MAX_BO_PRIORITY; ++j) {


________________________________________
发件人: Pan, Xinhui <Xinhui.Pan@xxxxxxx>
发送时间: 2021年5月19日 12:09
收件人: Kuehling, Felix; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
抄送: Deucher, Alexander; Koenig, Christian;
dri-devel@xxxxxxxxxxxxxxxxxxxxx; daniel@xxxxxxxx
主题: 回复: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to
swapout and swapin

yes, we really dont swapout SG BOs.
The problems is that before we validate a userptr BO, we create this
BO in CPU domain by default. So this BO has chance to swapout.

we set flag TTM_PAGE_FLAG_SG on userptr BO in popluate() which is too
late.
I have not try to revert Chris' patch as I think it desnt help. Or I
can have a try later.

________________________________________
发件人: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
发送时间: 2021年5月19日 11:29
收件人: Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
抄送: Deucher, Alexander; Koenig, Christian;
dri-devel@xxxxxxxxxxxxxxxxxxxxx; daniel@xxxxxxxx
主题: Re: [RFC PATCH 1/2] drm/amdgpu: Fix memory corruption due to
swapout and swapin

Swapping SG BOs makes no sense, because TTM doesn't own the pages of
this type of BO.

Last I checked, userptr BOs (and other SG BOs) were protected from
swapout by the fact that they would not be added to the swap-LRU. But it
looks like Christian just removed the swap-LRU. I guess this broke that
protection:

commit 2cb51d22d70b18eaf339abf9758bf0b7608da65c
Author: Christian König <christian.koenig@xxxxxxx>
Date:   Tue Oct 6 16:30:09 2020 +0200

       drm/ttm: remove swap LRU v3

       Instead evict round robin from each devices SYSTEM and TT domain.

       v2: reorder num_pages access reported by Dan's script
       v3: fix rebase fallout, num_pages should be 32bit

       Signed-off-by: Christian König <christian.koenig@xxxxxxx>
       Tested-by: Nirmoy Das <nirmoy.das@xxxxxxx>
       Reviewed-by: Huang Rui <ray.huang@xxxxxxx>
       Reviewed-by: Matthew Auld <matthew.auld@xxxxxxxxx>
       Link: https://patchwork.freedesktop.org/patch/424009/

Regards,
     Felix


On 2021-05-18 10:28 p.m., xinhui pan wrote:
cpu 1                                           cpu 2
kfd alloc BO A(userptr)                         alloc BO B(GTT)
       ->init -> validate                               -> init ->
validate -> populate
       init_user_pages                            -> swapout BO A
//hit ttm pages limit
        -> get_user_pages (fill up ttm->pages)
         -> validate -> populate
             -> swapin BO A // Now hit the BUG

We know that get_user_pages may race with swapout on same BO.
Threre are some issues I have met.
1) memory corruption.
This is because we do a swap before memory is setup. ttm_tt_swapout()
just create a swap_storage with its content being 0x0. So when we setup
memory after the swapout. The following swapin makes the memory
corrupted.

2) panic
When swapout happes with get_user_pages, they touch ttm->pages without
anylock. It causes memory corruption too. But I hit page fault mostly.

Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 16
+++++++++++++++-
    1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 928e8d57cd08..42460e4480f8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -835,6 +835,7 @@ static int init_user_pages(struct kgd_mem *mem,
uint64_t user_addr)
        struct amdkfd_process_info *process_info = mem->process_info;
        struct amdgpu_bo *bo = mem->bo;
        struct ttm_operation_ctx ctx = { true, false };
+     struct page **pages;
        int ret = 0;

        mutex_lock(&process_info->lock);
@@ -852,7 +853,13 @@ static int init_user_pages(struct kgd_mem *mem,
uint64_t user_addr)
                goto out;
        }

-     ret = amdgpu_ttm_tt_get_user_pages(bo, bo->tbo.ttm->pages);
+     pages = kvmalloc_array(bo->tbo.ttm->num_pages,
+                     sizeof(struct page *),
+                     GFP_KERNEL | __GFP_ZERO);
+     if (!pages)
+             goto unregister_out;
+
+     ret = amdgpu_ttm_tt_get_user_pages(bo, pages);
        if (ret) {
                pr_err("%s: Failed to get user pages: %d\n",
__func__, ret);
                goto unregister_out;
@@ -863,6 +870,12 @@ static int init_user_pages(struct kgd_mem *mem,
uint64_t user_addr)
                pr_err("%s: Failed to reserve BO\n", __func__);
                goto release_out;
        }
+
+     WARN_ON_ONCE(bo->tbo.ttm->page_flags & TTM_PAGE_FLAG_SWAPPED);
+
+     memcpy(bo->tbo.ttm->pages,
+                     pages,
+                     sizeof(struct page*) * bo->tbo.ttm->num_pages);
        amdgpu_bo_placement_from_domain(bo, mem->domain);
        ret = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
        if (ret)
@@ -872,6 +885,7 @@ static int init_user_pages(struct kgd_mem *mem,
uint64_t user_addr)
    release_out:
        amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
    unregister_out:
+     kvfree(pages);
        if (ret)
                amdgpu_mn_unregister(bo);
    out:
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux