On 6/7/19 1:44 PM, Jason Gunthorpe wrote:
On Fri, Jun 07, 2019 at 01:21:12PM -0700, Ralph Campbell wrote:
What I want to get to is a pattern like this:
pagefault():
hmm_range_register(&range);
again:
/* On the slow path, if we appear to be live locked then we get
the write side of mmap_sem which will break the live lock,
otherwise this gets the read lock */
if (hmm_range_start_and_lock(&range))
goto err;
lockdep_assert_held(range->mm->mmap_sem);
// Optional: Avoid useless expensive work
if (hmm_range_needs_retry(&range))
goto again;
hmm_range_(touch vmas)
take_lock(driver->update);
if (hmm_range_end(&range) {
release_lock(driver->update);
goto again;
}
// Finish driver updates
release_lock(driver->update);
// Releases mmap_sem
hmm_range_unregister_and_unlock(&range);
What do you think?
Is it clear?
Jason
Are you talking about acquiring mmap_sem in hmm_range_start_and_lock()?
Usually, the fault code has to lock mmap_sem for read in order to
call find_vma() so it can set range.vma.
If HMM drops mmap_sem - which I don't think it should, just return an
error to tell the caller to drop mmap_sem and retry - the find_vma()
will need to be repeated as well.
Overall I don't think it makes a lot of sense to sleep for retry in
hmm_range_start_and_lock() while holding mmap_sem. It would be better
to drop that lock, sleep, then re-acquire it as part of the hmm logic.
The find_vma should be done inside the critical section created by
hmm_range_start_and_lock(), not before it. If we are retrying then we
already slept and the additional CPU cost to repeat the find_vma is
immaterial, IMHO?
Do you see a reason why the find_vma() ever needs to be before the
'again' in my above example? range.vma does not need to be set for
range_register.
Yes, for the GPU case, there can be many faults in an event queue
and the goal is to try to handle more than one page at a time.
The vma is needed to limit the amount of coalescing and checking
for pages that could be speculatively migrated or mapped.
I'm also not sure about acquiring the mmap_sem for write as way to
mitigate thrashing. It seems to me that if a device and a CPU are
both faulting on the same page,
One of the reasons to prefer this approach is that it means we don't
need to keep track of which ranges we are faulting, and if there is a
lot of *unrelated* fault activity (unlikely?) we can resolve it using
mmap sem instead of this elaborate ranges scheme and related
locking.
This would reduce the overall work in the page fault and
invalidate_start/end paths for the common uncontended cases.
some sort of backoff delay is needed to let one side or the other
make some progress.
What the write side of the mmap_sem would do is force the CPU and
device to cleanly take turns. Once the device pages are registered
under the write side the CPU will have to wait in invalidate_start for
the driver to complete a shootdown, then the whole thing starts all
over again.
It is certainly imaginable something could have a 'min life' timer for
a device mapping and hold mm invalidate_start, and device pagefault
for that min time to promote better sharing.
But, if we don't use the mmap_sem then we can livelock and the device
will see an unrecoverable error from the timeout which means we have
risk that under load the system will simply obscurely fail. This seems
unacceptable to me..
Particularly since for the ODP use case the issue is not trashing
migration as a GPU might have, but simple system stability under swap
load. We do not want the ODP pagefault to permanently fail due to
timeout if the VMA is still valid..
Jason
OK, I understand.
If you come up with a set of changes, I can try testing them.