[AMD Official Use Only - AMD Internal Distribution Only] >-----Original Message----- >From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> >Sent: Thursday, March 6, 2025 8:53 AM >To: Deng, Emily <Emily.Deng@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx >Subject: Re: [PATCH] drm/amdgpu: Fix the race condition for draining retry fault > > >On 2025-03-05 19:49, Emily Deng wrote: >> Issue: >> In the scenario where svm_range_restore_pages is called, but svm- >>checkpoint_ts >> has not been set and the retry fault has not been drained, >> svm_range_unmap_from_cpu is triggered and calls svm_range_free. >> Meanwhile, svm_range_restore_pages continues execution and reaches >> svm_range_from_addr. This results in a "failed to find prange..." error, causing the >page recovery to fail. >> >> How to fix: >> Move the timestamp check code under the protection of svm->lock. >> >> Signed-off-by: Emily Deng <Emily.Deng@xxxxxxx> >> --- >> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 27 ++++++++++++++------------- >> 1 file changed, 14 insertions(+), 13 deletions(-) >> >> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> index d04725583f19..c99c10e247ad 100644 >> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> @@ -3008,19 +3008,6 @@ svm_range_restore_pages(struct amdgpu_device >*adev, unsigned int pasid, >> goto out; >> } >> >> - /* check if this page fault time stamp is before svms->checkpoint_ts */ >> - if (svms->checkpoint_ts[gpuidx] != 0) { >> - if (amdgpu_ih_ts_after_or_equal(ts, svms->checkpoint_ts[gpuidx])) { >> - pr_debug("draining retry fault, drop fault 0x%llx\n", addr); >> - r = 0; >> - goto out; >> - } else >> - /* ts is after svms->checkpoint_ts now, reset svms- >>checkpoint_ts >> - * to zero to avoid following ts wrap around give wrong >comparing >> - */ >> - svms->checkpoint_ts[gpuidx] = 0; >> - } >> - >> if (!p->xnack_enabled) { >> pr_debug("XNACK not enabled for pasid 0x%x\n", pasid); >> r = -EFAULT; >> @@ -3040,6 +3027,20 @@ svm_range_restore_pages(struct amdgpu_device >*adev, unsigned int pasid, >> mmap_read_lock(mm); >> retry_write_locked: >> mutex_lock(&svms->lock); >> + >> + /* check if this page fault time stamp is before svms->checkpoint_ts */ >> + if (svms->checkpoint_ts[gpuidx] != 0) { >> + if (amdgpu_ih_ts_after_or_equal(ts, svms->checkpoint_ts[gpuidx])) { >> + pr_debug("draining retry fault, drop fault 0x%llx\n", addr); >> + r = 0; >> + goto out; > >I believe you need to goto out_unlock_svms here to make sure all the right locks are >released. > >Regards, > Felix Thanks, will modify later. Emily Deng Best Wishes > >> + } else >> + /* ts is after svms->checkpoint_ts now, reset svms- >>checkpoint_ts >> + * to zero to avoid following ts wrap around give wrong >comparing >> + */ >> + svms->checkpoint_ts[gpuidx] = 0; >> + } >> + >> prange = svm_range_from_addr(svms, addr, NULL); >> if (!prange) { >> pr_debug("failed to find prange svms 0x%p address [0x%llx]\n",