[AMD Official Use Only - AMD Internal Distribution Only] >-----Original Message----- >From: Yang, Philip <Philip.Yang@xxxxxxx> >Sent: Tuesday, March 4, 2025 11:00 PM >To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Subject: Re: [PATCH] drm/amdgpu: Fix missing drain retry fault the last entry > > >On 2025-03-03 19:44, Deng, Emily wrote: >> [AMD Official Use Only - AMD Internal Distribution Only] >> >> [AMD Official Use Only - AMD Internal Distribution Only] >> >> Ping...... >> >> Emily Deng >> Best Wishes >> >> >> >>> -----Original Message----- >>> From: Emily Deng <Emily.Deng@xxxxxxx> >>> Sent: Monday, March 3, 2025 5:35 PM >>> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>> Cc: Deng, Emily <Emily.Deng@xxxxxxx> >>> Subject: [PATCH] drm/amdgpu: Fix missing drain retry fault the last >>> entry >>> >>> For equal case, it also need to be dropped. >>> >>> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >>> --- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h >>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h >>> index 7d4395a5d8ac..73b8bcb54734 100644 >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h >>> @@ -76,7 +76,7 @@ struct amdgpu_ih_ring { >>> >>> /* return true if time stamp t2 is after t1 with 48bit wrap around */ >>> #define amdgpu_ih_ts_after(t1, t2) \ >>> - (((int64_t)((t2) << 16) - (int64_t)((t1) << 16)) > 0LL) >>> + (((int64_t)((t2) << 16) - (int64_t)((t1) << 16)) >= >>> + 0LL) > >The comment is correct and current condition is correct too, >svm_range_drain_retry_fault drop the retry fault with same timestamp as the IH >checkpoint_wptr timestamp. Do you see GPU page fault with the stale retry fault after >process exit, or what issue do you want to fix? > >Regards, > >Philip This commit aims to address the page fault issue observed when running kfdtest with the following command: sudo bash -c "export GTEST_REPEAT=1000; export GTEST_BREAK_ON_FAILURE=1; ./kfdtest --gtest_filter=KFDSVMRangeTest.*:-KFDSVMRangeTest.ReadOnlyRangeTest* --timeout 100000" 2>/dev/null The issue is severe because the page fault triggers a call to kfd_set_dbg_ev_from_interrupt, which subsequently invokes kfd_dqm_evict_pasid. As a result, the process is evicted, leading to test failures. This commit resolves the root cause of the page fault to prevent such evictions. Emily Deng Best Wishes > > >>> >>> /* provided by the ih block */ >>> struct amdgpu_ih_funcs { >>> -- >>> 2.34.1