Am 2021-11-01 um 10:40 p.m. schrieb Alex Sierra: > [Why]: > When we call hmm_range_fault to map memory after a migration, we don't > expect memory to be migrated again as a result of hmm_range_fault. The > driver ensures that all memory is in GPU-accessible locations so that > no migration should be needed. However, there is one corner case where > hmm_range_fault can unexpectedly cause a migration from DEVICE_PRIVATE > back to system memory due to a write-fault when a system memory page in > the same range was mapped read-only (e.g. COW). Ranges with individual > pages in different locations are usually the result of failed page > migrations (e.g. page lock contention). The unexpected migration back > to system memory causes a deadlock from recursive locking in our > driver. > > [How]: > Creating a task reference new member under svm_range_list_init struct. The _init is not part of the struct name. With that fixed, the patch is Reviewed-by: Felix Kuehling <Felix.Kuehling@xxxxxxx> > Setting this with "current" reference, right before the hmm_range_fault > is called. This member is checked against "current" reference at > svm_migrate_to_ram callback function. If equal, the migration will be > ignored. > > Signed-off-by: Alex Sierra <alex.sierra@xxxxxxx> > --- > drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 4 ++++ > drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2 ++ > 3 files changed, 7 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c > index bff40e8bca67..eb19f44ec86d 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c > @@ -936,6 +936,10 @@ static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf) > pr_debug("failed find process at fault address 0x%lx\n", addr); > return VM_FAULT_SIGBUS; > } > + if (READ_ONCE(p->svms.faulting_task) == current) { > + pr_debug("skipping ram migration\n"); > + return 0; > + } > addr >>= PAGE_SHIFT; > pr_debug("CPU page fault svms 0x%p address 0x%lx\n", &p->svms, addr); > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h > index f88666bdf57c..7b41a58b1ade 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h > @@ -858,6 +858,7 @@ struct svm_range_list { > atomic_t evicted_ranges; > struct delayed_work restore_work; > DECLARE_BITMAP(bitmap_supported, MAX_GPU_INSTANCE); > + struct task_struct *faulting_task; > }; > > /* Process data */ > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > index 939c863315ba..4031c2a67af4 100644 > --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c > @@ -1492,9 +1492,11 @@ static int svm_range_validate_and_map(struct mm_struct *mm, > > next = min(vma->vm_end, end); > npages = (next - addr) >> PAGE_SHIFT; > + WRITE_ONCE(p->svms.faulting_task, current); > r = amdgpu_hmm_range_get_pages(&prange->notifier, mm, NULL, > addr, npages, &hmm_range, > readonly, true, owner); > + WRITE_ONCE(p->svms.faulting_task, NULL); > if (r) { > pr_debug("failed %d to get svm range pages\n", r); > goto unreserve_out;