[Public] Hi Christian, My BAD, I checked that discussion history of this just now. So If I read it correctly, the double check at a different place to skip evict is: " drm/ttm: Double check mem_type of BO while eviction"? It is in 5.16 kernel. Regards, Guchun -----Original Message----- From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> Sent: Tuesday, January 11, 2022 7:27 PM To: Chen, Guchun <Guchun.Chen@xxxxxxx>; Pan, Xinhui <Xinhui.Pan@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx Subject: Re: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list IIRC we have completely dropped this patch in favor of a check at a different place. Regards, Christian. Am 11.01.22 um 09:47 schrieb Chen, Guchun: > [Public] > > Hi Christian, > > Looks this patch still missed in 5.16 kernel. Is it intentional? > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit. > kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2 > Ftree%2Fdrivers%2Fgpu%2Fdrm%2Fttm%2Fttm_bo.c%3Fh%3Dv5.16&data=04%7 > C01%7CGuchun.Chen%40amd.com%7Cf3b7f4971dc8405b0c2908d9d4f55547%7C3dd89 > 61fe4884e608e11a82d994e183d%7C0%7C0%7C637774972434004088%7CUnknown%7CT > WFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI > 6Mn0%3D%7C3000&sdata=vbuBPHO40J2HGt7abzfzC0nC1DQa62qal5S6TXBRj4w%3 > D&reserved=0 > > Regards, > Guchun > > -----Original Message----- > From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of > Pan, Xinhui > Sent: Tuesday, November 9, 2021 9:16 PM > To: Koenig, Christian <Christian.Koenig@xxxxxxx>; > amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: dri-devel@xxxxxxxxxxxxxxxxxxxxx > Subject: 回复: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru > list > > [AMD Official Use Only] > > [AMD Official Use Only] > > Actually this patch does not totally fix the mismatch of lru list with mem_type as mem_type is changed in ->move() and lru list is changed after that. > > During this small period, another eviction could still happed and evict this mismatched BO from sMam(say, its lru list is on vram domain) to sMem. > ________________________________________ > 发件人: Pan, Xinhui <Xinhui.Pan@xxxxxxx> > 发送时间: 2021年11月9日 21:05 > 收件人: Koenig, Christian; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > 抄送: dri-devel@xxxxxxxxxxxxxxxxxxxxx > 主题: 回复: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list > > Yes, a stable tag is needed. vulkan guys say 5.14 hit this issue too. > > I think that amdgpu_bo_move() does support copy from sysMem to sysMem correctly. > maybe something below is needed. > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > index c83ef42ca702..aa63ae7ddf1e 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c > @@ -485,7 +485,8 @@ static int amdgpu_bo_move(struct ttm_buffer_object *bo, bool evict, > } > if (old_mem->mem_type == TTM_PL_SYSTEM && > (new_mem->mem_type == TTM_PL_TT || > - new_mem->mem_type == AMDGPU_PL_PREEMPT)) { > + new_mem->mem_type == AMDGPU_PL_PREEMPT || > + new_mem->mem_type == TTM_PL_SYSTEM)) { > ttm_bo_move_null(bo, new_mem); > goto out; > } > > otherwise, amdgpu_move_blit() is called to do the system memory copy which use a wrong address. > 206 /* Map only what can't be accessed directly */ > 207 if (!tmz && mem->start != AMDGPU_BO_INVALID_OFFSET) { > 208 *addr = amdgpu_ttm_domain_start(adev, mem->mem_type) + > 209 mm_cur->start; > 210 return 0; > 211 } > > line 208, *addr is zero. So when amdgpu_copy_buffer submit job with such addr, page fault happens. > > > ________________________________________ > 发件人: Koenig, Christian <Christian.Koenig@xxxxxxx> > 发送时间: 2021年11月9日 20:35 > 收件人: Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > 抄送: dri-devel@xxxxxxxxxxxxxxxxxxxxx > 主题: Re: 回复: [PATCH] drm/ttm: Put BO in its memory manager's lru list > > Mhm, I'm not sure what the rational behind that is. > > Not moving the BO would make things less efficient, but should never cause a crash. > > Maybe we should add a CC: stable tag and push it to -fixes instead? > > Christian. > > Am 09.11.21 um 13:28 schrieb Pan, Xinhui: >> [AMD Official Use Only] >> >> I hit vulkan cts test hang with navi23. >> >> dmesg says gmc page fault with address 0x0, 0x1000, 0x2000.... >> And some debug log also says amdgu copy one BO from system Domain to system Domain which is really weird. >> ________________________________________ >> 发件人: Koenig, Christian <Christian.Koenig@xxxxxxx> >> 发送时间: 2021年11月9日 20:20 >> 收件人: Pan, Xinhui; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> 抄送: dri-devel@xxxxxxxxxxxxxxxxxxxxx >> 主题: Re: [PATCH] drm/ttm: Put BO in its memory manager's lru list >> >> Am 09.11.21 um 12:19 schrieb xinhui pan: >>> After we move BO to a new memory region, we should put it to the new >>> memory manager's lru list regardless we unlock the resv or not. >>> >>> Signed-off-by: xinhui pan <xinhui.pan@xxxxxxx> >> Interesting find, did you trigger that somehow or did you just >> stumbled over it by reading the code? >> >> Patch is Reviewed-by: Christian König <christian.koenig@xxxxxxx>, I >> will pick that up for drm-misc-next. >> >> Thanks, >> Christian. >> >>> --- >>> drivers/gpu/drm/ttm/ttm_bo.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c >>> b/drivers/gpu/drm/ttm/ttm_bo.c index f1367107925b..e307004f0b28 >>> 100644 >>> --- a/drivers/gpu/drm/ttm/ttm_bo.c >>> +++ b/drivers/gpu/drm/ttm/ttm_bo.c >>> @@ -701,6 +701,8 @@ int ttm_mem_evict_first(struct ttm_device *bdev, >>> ret = ttm_bo_evict(bo, ctx); >>> if (locked) >>> ttm_bo_unreserve(bo); >>> + else >>> + ttm_bo_move_to_lru_tail_unlocked(bo); >>> >>> ttm_bo_put(bo); >>> return ret;