[AMD Official Use Only - AMD Internal Distribution Only] >-----Original Message----- >From: Lazar, Lijo <Lijo.Lazar@xxxxxxx> >Sent: Thursday, March 6, 2025 1:12 PM >To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Subject: Re: [PATCH v3] drm/amdgpu: Fix the race condition for draining retry fault > > > >On 3/6/2025 7:23 AM, Emily Deng wrote: >> Issue: >> In the scenario where svm_range_restore_pages is called, but >> svm->checkpoint_ts has not been set and the retry fault has not been >> drained, svm_range_unmap_from_cpu is triggered and calls >> svm_range_free. Meanwhile, svm_range_restore_pages continues execution >> and reaches svm_range_from_addr. This results in a "failed to find prange..." error, >causing the page recovery to fail. >> >> How to fix: >> Move the timestamp check code under the protection of svm->lock. >> >> v2: >> Make sure all right locks are released before go out. >> >> v3: >> Directly goto out_unlock_svms, and return -EAGAIN. >> >> >> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 34 >> ++++++++++++++++------------ >> 1 file changed, 19 insertions(+), 15 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> index d04725583f19..42ee49d19ee9 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> @@ -2917,10 +2917,13 @@ static bool svm_range_skip_recover(struct >> svm_range *prange) >> >> static void >> svm_range_count_fault(struct kfd_node *node, struct kfd_process *p, >> - int32_t gpuidx) >> + int32_t gpuidx, int r) >> { >> struct kfd_process_device *pdd; >> >> + if (r == -EAGAIN) >> + return; > >As a generic code comment - the parameter passed to a function should have some >meaning and serve some purpose. I don't think 'r' has any such. You may try to >generate a /doc describing the params passed to this function and then probably will >realize the same. > >Thanks, >Lijo Will remove the params, and refine the change. Thanks. Emily Deng Best Wishes > >> + >> /* fault is on different page of same range >> * or fault is skipped to recover later >> * or fault is on invalid virtual address @@ -3008,19 +3011,6 @@ >> svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, >> goto out; >> } >> >> - /* check if this page fault time stamp is before svms->checkpoint_ts */ >> - if (svms->checkpoint_ts[gpuidx] != 0) { >> - if (amdgpu_ih_ts_after_or_equal(ts, svms->checkpoint_ts[gpuidx])) { >> - pr_debug("draining retry fault, drop fault 0x%llx\n", addr); >> - r = 0; >> - goto out; >> - } else >> - /* ts is after svms->checkpoint_ts now, reset svms- >>checkpoint_ts >> - * to zero to avoid following ts wrap around give wrong >comparing >> - */ >> - svms->checkpoint_ts[gpuidx] = 0; >> - } >> - >> if (!p->xnack_enabled) { >> pr_debug("XNACK not enabled for pasid 0x%x\n", pasid); >> r = -EFAULT; >> @@ -3040,6 +3030,20 @@ svm_range_restore_pages(struct amdgpu_device >*adev, unsigned int pasid, >> mmap_read_lock(mm); >> retry_write_locked: >> mutex_lock(&svms->lock); >> + >> + /* check if this page fault time stamp is before svms->checkpoint_ts */ >> + if (svms->checkpoint_ts[gpuidx] != 0) { >> + if (amdgpu_ih_ts_after_or_equal(ts, svms->checkpoint_ts[gpuidx])) { >> + pr_debug("draining retry fault, drop fault 0x%llx\n", addr); >> + r = -EAGAIN; >> + goto out_unlock_svms; >> + } else >> + /* ts is after svms->checkpoint_ts now, reset svms- >>checkpoint_ts >> + * to zero to avoid following ts wrap around give wrong >comparing >> + */ >> + svms->checkpoint_ts[gpuidx] = 0; >> + } >> + >> prange = svm_range_from_addr(svms, addr, NULL); >> if (!prange) { >> pr_debug("failed to find prange svms 0x%p address [0x%llx]\n", @@ >> -3165,7 +3169,7 @@ svm_range_restore_pages(struct amdgpu_device *adev, >unsigned int pasid, >> mutex_unlock(&svms->lock); >> mmap_read_unlock(mm); >> >> - svm_range_count_fault(node, p, gpuidx); >> + svm_range_count_fault(node, p, gpuidx, r); >> >> mmput(mm); >> out: