On Fri, Jun 07, 2019 at 01:21:12PM -0700, Ralph Campbell wrote: > > What I want to get to is a pattern like this: > > > > pagefault(): > > > > hmm_range_register(&range); > > again: > > /* On the slow path, if we appear to be live locked then we get > > the write side of mmap_sem which will break the live lock, > > otherwise this gets the read lock */ > > if (hmm_range_start_and_lock(&range)) > > goto err; > > > > lockdep_assert_held(range->mm->mmap_sem); > > > > // Optional: Avoid useless expensive work > > if (hmm_range_needs_retry(&range)) > > goto again; > > hmm_range_(touch vmas) > > > > take_lock(driver->update); > > if (hmm_range_end(&range) { > > release_lock(driver->update); > > goto again; > > } > > // Finish driver updates > > release_lock(driver->update); > > > > // Releases mmap_sem > > hmm_range_unregister_and_unlock(&range); > > > > What do you think? > > > > Is it clear? > > > > Jason > > > > Are you talking about acquiring mmap_sem in hmm_range_start_and_lock()? > Usually, the fault code has to lock mmap_sem for read in order to > call find_vma() so it can set range.vma. > If HMM drops mmap_sem - which I don't think it should, just return an > error to tell the caller to drop mmap_sem and retry - the find_vma() > will need to be repeated as well. Overall I don't think it makes a lot of sense to sleep for retry in hmm_range_start_and_lock() while holding mmap_sem. It would be better to drop that lock, sleep, then re-acquire it as part of the hmm logic. The find_vma should be done inside the critical section created by hmm_range_start_and_lock(), not before it. If we are retrying then we already slept and the additional CPU cost to repeat the find_vma is immaterial, IMHO? Do you see a reason why the find_vma() ever needs to be before the 'again' in my above example? range.vma does not need to be set for range_register. > I'm also not sure about acquiring the mmap_sem for write as way to > mitigate thrashing. It seems to me that if a device and a CPU are > both faulting on the same page, One of the reasons to prefer this approach is that it means we don't need to keep track of which ranges we are faulting, and if there is a lot of *unrelated* fault activity (unlikely?) we can resolve it using mmap sem instead of this elaborate ranges scheme and related locking. This would reduce the overall work in the page fault and invalidate_start/end paths for the common uncontended cases. > some sort of backoff delay is needed to let one side or the other > make some progress. What the write side of the mmap_sem would do is force the CPU and device to cleanly take turns. Once the device pages are registered under the write side the CPU will have to wait in invalidate_start for the driver to complete a shootdown, then the whole thing starts all over again. It is certainly imaginable something could have a 'min life' timer for a device mapping and hold mm invalidate_start, and device pagefault for that min time to promote better sharing. But, if we don't use the mmap_sem then we can livelock and the device will see an unrecoverable error from the timeout which means we have risk that under load the system will simply obscurely fail. This seems unacceptable to me.. Particularly since for the ODP use case the issue is not trashing migration as a GPU might have, but simple system stability under swap load. We do not want the ODP pagefault to permanently fail due to timeout if the VMA is still valid.. Jason _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx