[AMD Official Use Only - AMD Internal Distribution Only] Add more info and CC @Kuehling, Felix @cao, lin In amdgpu_amdkfd_fence.c, there is a design description: /* Eviction Fence * Fence helper functions to deal with KFD memory eviction. * Big Idea - Since KFD submissions are done by user queues, a BO cannot be * evicted unless all the user queues for that process are evicted. * * All the BOs in a process share an eviction fence. When process X wants * to map VRAM memory but TTM can't find enough space, TTM will attempt to * evict BOs from its LRU list. TTM checks if the BO is valuable to evict * by calling ttm_device_funcs->eviction_valuable(). * * ttm_device_funcs->eviction_valuable() - will return false if the BO belongs * to process X. Otherwise, it will return true to indicate BO can be * evicted by TTM. * * If ttm_device_funcs->eviction_valuable returns true, then TTM will continue * the evcition process for that BO by calling ttm_bo_evict --> amdgpu_bo_move * --> amdgpu_copy_buffer(). This sets up job in GPU scheduler. * * GPU Scheduler (amd_sched_main) - sets up a cb (fence_add_callback) to * nofity when the BO is free to move. fence_add_callback --> enable_signaling * --> amdgpu_amdkfd_fence.enable_signaling * * amdgpu_amdkfd_fence.enable_signaling - Start a work item that will quiesce * user queues and signal fence. The work item will also start another delayed * work item to restore BOs */ If mark BOs as invalidated before submitting job to move the buffer, user queue is still active. During the time before user queue is evicted, if a drm job achieve, amdgpu_cs_vm_handling will call amdgpu_vm_handle_moved to clear the ptes of Invalidated BOs. Then page fault happens because compute shader is still accessing the "invalidated" BO. I am not familiar with amdgpu_vm_bo state machine, so I don’t know if it is an code error or an design error. Thanks River -----Original Message----- From: YuanShang Mao (River) Sent: Friday, July 12, 2024 10:55 AM To: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> Cc: Huang, Trigger <Trigger.Huang@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: RE: [PATCH] drm/amdgpu: Mark amdgpu_bo as invalid after moved We need to make sure that all BOs of an active kfd process validated. Moving buffer will trigger process eviction. If mark it as invalided before process eviction, related kfd process is still active and may attempt to access this invalidated BO. Agree with Trigger. Seems kfd eviction should been synced to move notify, not the move action. Thanks River -----Original Message----- From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx> Sent: Thursday, July 11, 2024 8:39 PM To: Huang, Trigger <Trigger.Huang@xxxxxxx>; YuanShang Mao (River) <YuanShang.Mao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx Subject: Re: [PATCH] drm/amdgpu: Mark amdgpu_bo as invalid after moved Yeah, completely agree. This patch doesn't really make sense. Please explain why you would want to do this? Regards, Christian. Am 11.07.24 um 13:56 schrieb Huang, Trigger: > [AMD Official Use Only - AMD Internal Distribution Only] > > This patch seems to be wrong. > Quite a lot of preparations have been done in amdgpu_bo_move_notify > For example, amdgpu_bo_kunmap() will be called to prevent the BO from being accessed by CPU. If not called, the CPU may attempt to access the BO while it is being moved. > > Thanks, > Trigger > >> -----Original Message----- >> From: amd-gfx <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf Of >> YuanShang >> Sent: Thursday, July 11, 2024 5:10 PM >> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx >> Cc: YuanShang Mao (River) <YuanShang.Mao@xxxxxxx>; YuanShang Mao >> (River) <YuanShang.Mao@xxxxxxx> >> Subject: [PATCH] drm/amdgpu: Mark amdgpu_bo as invalid after moved >> >> Caution: This message originated from an External Source. Use proper >> caution when opening attachments, clicking links, or responding. >> >> >> It leads to race condition if amdgpu_bo is marked as invalid before >> it is really moved. >> >> Signed-off-by: YuanShang <YuanShang.Mao@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 10 +++++----- >> 1 file changed, 5 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >> index 29e4b5875872..a29d5132ad3d 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c >> @@ -519,8 +519,8 @@ static int amdgpu_bo_move(struct >> ttm_buffer_object *bo, bool evict, >> >> if (!old_mem || (old_mem->mem_type == TTM_PL_SYSTEM && >> bo->ttm == NULL)) { >> - amdgpu_bo_move_notify(bo, evict, new_mem); >> ttm_bo_move_null(bo, new_mem); >> + amdgpu_bo_move_notify(bo, evict, new_mem); >> return 0; >> } >> if (old_mem->mem_type == AMDGPU_GEM_DOMAIN_DGMA || @@ - >> 530,8 +530,8 @@ static int amdgpu_bo_move(struct ttm_buffer_object >> *bo, bool evict, >> if (old_mem->mem_type == TTM_PL_SYSTEM && >> (new_mem->mem_type == TTM_PL_TT || >> new_mem->mem_type == AMDGPU_PL_PREEMPT)) { >> - amdgpu_bo_move_notify(bo, evict, new_mem); >> ttm_bo_move_null(bo, new_mem); >> + amdgpu_bo_move_notify(bo, evict, new_mem); >> return 0; >> } >> if ((old_mem->mem_type == TTM_PL_TT || @@ -542,9 +542,9 @@ >> static int amdgpu_bo_move(struct ttm_buffer_object *bo, bool evict, >> return r; >> >> amdgpu_ttm_backend_unbind(bo->bdev, bo->ttm); >> - amdgpu_bo_move_notify(bo, evict, new_mem); >> ttm_resource_free(bo, &bo->resource); >> ttm_bo_assign_mem(bo, new_mem); >> + amdgpu_bo_move_notify(bo, evict, new_mem); >> return 0; >> } >> >> @@ -557,8 +557,8 @@ static int amdgpu_bo_move(struct >> ttm_buffer_object *bo, bool evict, >> new_mem->mem_type == AMDGPU_PL_OA || >> new_mem->mem_type == AMDGPU_PL_DOORBELL) { >> /* Nothing to save here */ >> - amdgpu_bo_move_notify(bo, evict, new_mem); >> ttm_bo_move_null(bo, new_mem); >> + amdgpu_bo_move_notify(bo, evict, new_mem); >> return 0; >> } >> >> @@ -583,11 +583,11 @@ static int amdgpu_bo_move(struct >> ttm_buffer_object *bo, bool evict, >> return -EMULTIHOP; >> } >> >> - amdgpu_bo_move_notify(bo, evict, new_mem); >> if (adev->mman.buffer_funcs_enabled) >> r = amdgpu_move_blit(bo, evict, new_mem, old_mem); >> else >> r = -ENODEV; >> + amdgpu_bo_move_notify(bo, evict, new_mem); >> >> if (r) { >> /* Check that all memory is CPU accessible */ >> -- >> 2.25.1