On Tue, Dec 06, 2022, James Houghton wrote: > On Mon, Dec 5, 2022 at 8:06 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > On Mon, Dec 05, 2022, James Houghton wrote: > > > On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > > > On Mon, Dec 05, 2022, David Matlack wrote: > > > > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > ... > > > > > > I'll have a closer read on the nested part, but note that this path already > > > > > > has the mmap lock then it invalidates the goal if we want to avoid taking > > > > > > it from the first place, or maybe we don't care? > > > > > > Not taking the mmap lock would be helpful, but we still have to take > > > it in UFFDIO_CONTINUE, so it's ok if we have to still take it here. > > > > IIUC, Peter is suggesting that the kernel not even get to the point where UFFD > > is involved. The "fault" would get propagated to userspace by KVM, userspace > > fixes the fault (gets the page from the source, does MADV_POPULATE_WRITE), and > > resumes the vCPU. > > If we haven't UFFDIO_CONTINUE'd some address range yet, > MADV_POPULATE_WRITE for that range will drop into handle_userfault and > go to sleep. Not good! Ah, right, userspace would still need to register UFFD for the region to handle non-KVM (or incompatible KVM) accesses and could loop back on itself. > So, going with the no-slow-GUP approach, resolving faults is done like this: > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart > KVM_RUN. The PTEs will be none/blank right now. This is the common > case. > - If we have UFFDIO_CONTINUE'd already, if we were to do it again, we > would get EEXIST. (In this case, we probably have some type of swap > entry in the page tables.) We have to change the page tables to make > fast GUP succeed now *without* using UFFDIO_CONTINUE now. > MADV_POPULATE_WRITE seems to be the right tool for the job. This case > happens if the kernel has swapped the memory out, is migrating it, has > poisoned it, etc. If MADV_POPULATE_WRITE fails, we probably need to > crash or inject a memory error. > > So with this approach, we never need to take the mmap_lock for reading > in hva_to_pfn, but we still need to take it in UFFDIO_CONTINUE. > Without removing the mmap_lock from *both*, we don't gain much. > > So if we disregard this tiny mmap_lock benefit, the other approach > (the PF_NO_UFFD_WAIT approach) seems better. Can you elaborate on what makes it better? Or maybe generate a list of pros and cons? I can think of (dis)advantages for both approaches, but I haven't identified anything that would be a blocking issue for either approach. Doesn't mean there isn't one or more blocking issues, just that I haven't thought of any :-) > When KVM_RUN exits: > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN. > - If we have, then something bad has happened. Slow GUP already ran > and failed, so we need to treat this in the same way we treat a > MADV_POPULATE_WRITE failure above: userspace might just want to crash > (or inject a memory error or something). > > - James