On Fri, Nov 01, 2019 at 02:44:51PM +0000, Yang, Philip wrote: > @@ -854,12 +853,20 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages) > r = -EPERM; > goto out_unlock; > } > + up_read(&mm->mmap_sem); > + timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); > + > +retry: > + range->notifier_seq = mmu_range_read_begin(&bo->notifier); > > + down_read(&mm->mmap_sem); > r = hmm_range_fault(range, 0); > up_read(&mm->mmap_sem); > - > - if (unlikely(r < 0)) > + if (unlikely(r <= 0)) { > + if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout)) > + goto retry; > goto out_free_pfns; > + } I was reflecting on why this suddently became necessary, and I think what might be happening is that hmm_range_fault() is trigging invalidations as it runs (ie it is faulting in pages or something) and that in turn causes the mrn to need retry. The hmm version of this had a bug where a full invalidate_range_start/end pair would not trigger retry, so this this didn't happen. This is unfortunate as the retry is unnecessary, but at this time I can't think of a good way to separate an ignorable synchronous invalidation caused by hmm_range_fault from an async one that cannot be ignored.. A basic fix would be to not update the mrq seq in the notifier if the invalidate is triggered by hmm_range_fault, but that seems difficult to determine.. Any thoughts Jerome? Jason