Hi Qu,
Am 03.04.21 um 07:08 schrieb Qu Huang:
Hi Christian,
On 2021/4/3 0:25, Christian König wrote:
Hi Qu,
Am 02.04.21 um 05:18 schrieb Qu Huang:
Before dma_resv_lock(bo->base.resv, NULL) in
amdgpu_bo_release_notify(),
the bo->base.resv lock may be held by ttm_mem_evict_first(),
That can't happen since when bo_release_notify is called the BO has not
more references and is therefore deleted.
And we never evict a deleted BO, we just wait for it to become idle.
Yes, the bo reference counter return to zero will enter
ttm_bo_release(),but notify bo release (call amdgpu_bo_release_notify())
first happen, and then test if a reservation object's fences have been
signaled, and then mark bo as deleted and remove bo from the LRU list.
When ttm_bo_release() and ttm_mem_evict_first() is concurrent,
the Bo has not been removed from the LRU list and is not marked as
deleted, this will happen.
Not sure on which code base you are, but I don't see how this can happen.
ttm_mem_evict_first() calls ttm_bo_get_unless_zero() and
ttm_bo_release() is only called when the BO reference count becomes zero.
So ttm_mem_evict_first() will see that this BO is about to be destroyed
and skips it.
As a test, when we use CPU memset instead of SDMA fill in
amdgpu_bo_release_notify(), the result is page fault:
PID: 5490 TASK: ffff8e8136e04100 CPU: 4 COMMAND: "gemmPerf"
#0 [ffff8e79eaa17970] machine_kexec at ffffffffb2863784
#1 [ffff8e79eaa179d0] __crash_kexec at ffffffffb291ce92
#2 [ffff8e79eaa17aa0] crash_kexec at ffffffffb291cf80
#3 [ffff8e79eaa17ab8] oops_end at ffffffffb2f6c768
#4 [ffff8e79eaa17ae0] no_context at ffffffffb2f5aaa6
#5 [ffff8e79eaa17b30] __bad_area_nosemaphore at ffffffffb2f5ab3d
#6 [ffff8e79eaa17b80] bad_area_nosemaphore at ffffffffb2f5acae
#7 [ffff8e79eaa17b90] __do_page_fault at ffffffffb2f6f6c0
#8 [ffff8e79eaa17c00] do_page_fault at ffffffffb2f6f925
#9 [ffff8e79eaa17c30] page_fault at ffffffffb2f6b758
[exception RIP: memset+31]
RIP: ffffffffb2b8668f RSP: ffff8e79eaa17ce8 RFLAGS: 00010a17
RAX: bebebebebebebebe RBX: ffff8e747bff10c0 RCX: 0000060b00200000
RDX: 0000000000000000 RSI: 00000000000000be RDI: ffffab807f000000
RBP: ffff8e79eaa17d10 R8: ffff8e79eaa14000 R9: ffffab7c80000000
R10: 000000000000bcba R11: 00000000000001ba R12: ffff8e79ebaa4050
R13: ffffab7c80000000 R14: 0000000000022600 R15: ffff8e8136e04100
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff8e79eaa17ce8] amdgpu_bo_release_notify at ffffffffc092f2d1
[amdgpu]
#11 [ffff8e79eaa17d18] ttm_bo_release at ffffffffc08f39dd [amdttm]
#12 [ffff8e79eaa17d58] amdttm_bo_put at ffffffffc08f3c8c [amdttm]
#13 [ffff8e79eaa17d68] amdttm_bo_vm_close at ffffffffc08f7ac9 [amdttm]
#14 [ffff8e79eaa17d80] remove_vma at ffffffffb29ef115
#15 [ffff8e79eaa17da0] exit_mmap at ffffffffb29f2c64
#16 [ffff8e79eaa17e58] mmput at ffffffffb28940c7
#17 [ffff8e79eaa17e78] do_exit at ffffffffb289dc95
#18 [ffff8e79eaa17f10] do_group_exit at ffffffffb289e4cf
#19 [ffff8e79eaa17f40] sys_exit_group at ffffffffb289e544
#20 [ffff8e79eaa17f50] system_call_fastpath at ffffffffb2f74ddb
Well that might be perfectly expected. VRAM is not necessarily CPU
accessible.
Regards,
Christian.
Regards,
Qu.
Regards,
Christian.
and the VRAM mem will be evicted, mem region was replaced
by Gtt mem region. amdgpu_bo_release_notify() will then
hold the bo->base.resv lock, and SDMA will get an invalid
address in amdgpu_fill_buffer(), resulting in a VMFAULT
or memory corruption.
To avoid it, we have to hold bo->base.resv lock first, and
check whether the mem.mem_type is TTM_PL_VRAM.
Signed-off-by: Qu Huang <jinsdb@xxxxxxx>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 4b29b82..8018574 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1300,12 +1300,16 @@ void amdgpu_bo_release_notify(struct
ttm_buffer_object *bo)
if (bo->base.resv == &bo->base._resv)
amdgpu_amdkfd_remove_fence_on_pt_pd_bos(abo);
- if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node ||
- !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
+ if (!(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
return;
dma_resv_lock(bo->base.resv, NULL);
+ if (bo->mem.mem_type != TTM_PL_VRAM || !bo->mem.mm_node) {
+ dma_resv_unlock(bo->base.resv);
+ return;
+ }
+
r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv,
&fence);
if (!WARN_ON(r)) {
amdgpu_bo_fence(abo, fence, false);
--
1.8.3.1