On Tue, Dec 6, 2022 at 1:01 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Tue, Dec 06, 2022, James Houghton wrote: > > On Mon, Dec 5, 2022 at 8:06 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > On Mon, Dec 05, 2022, James Houghton wrote: > > > > On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > > > > > On Mon, Dec 05, 2022, David Matlack wrote: > > > > > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > > ... > > > > > > > I'll have a closer read on the nested part, but note that this path already > > > > > > > has the mmap lock then it invalidates the goal if we want to avoid taking > > > > > > > it from the first place, or maybe we don't care? > > > > > > > > Not taking the mmap lock would be helpful, but we still have to take > > > > it in UFFDIO_CONTINUE, so it's ok if we have to still take it here. > > > > > > IIUC, Peter is suggesting that the kernel not even get to the point where UFFD > > > is involved. The "fault" would get propagated to userspace by KVM, userspace > > > fixes the fault (gets the page from the source, does MADV_POPULATE_WRITE), and > > > resumes the vCPU. > > > > If we haven't UFFDIO_CONTINUE'd some address range yet, > > MADV_POPULATE_WRITE for that range will drop into handle_userfault and > > go to sleep. Not good! > > Ah, right, userspace would still need to register UFFD for the region to handle > non-KVM (or incompatible KVM) accesses and could loop back on itself. > > > So, going with the no-slow-GUP approach, resolving faults is done like this: > > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart > > KVM_RUN. The PTEs will be none/blank right now. This is the common > > case. > > - If we have UFFDIO_CONTINUE'd already, if we were to do it again, we > > would get EEXIST. (In this case, we probably have some type of swap > > entry in the page tables.) We have to change the page tables to make > > fast GUP succeed now *without* using UFFDIO_CONTINUE now. > > MADV_POPULATE_WRITE seems to be the right tool for the job. This case > > happens if the kernel has swapped the memory out, is migrating it, has > > poisoned it, etc. If MADV_POPULATE_WRITE fails, we probably need to > > crash or inject a memory error. > > > > So with this approach, we never need to take the mmap_lock for reading > > in hva_to_pfn, but we still need to take it in UFFDIO_CONTINUE. > > Without removing the mmap_lock from *both*, we don't gain much. > > > > So if we disregard this tiny mmap_lock benefit, the other approach > > (the PF_NO_UFFD_WAIT approach) seems better. > > Can you elaborate on what makes it better? Or maybe generate a list of pros and > cons? I can think of (dis)advantages for both approaches, but I haven't identified > anything that would be a blocking issue for either approach. Doesn't mean there > isn't one or more blocking issues, just that I haven't thought of any :-) Let's see.... so using no-slow-GUP over no UFFD waiting: - No need to take mmap_lock in mem fault path. - Change the relevant __gfn_to_pfn_memslot callers (kvm_faultin_pfn/user_mem_abort/others?) to set `atomic = true` if the new CAP is used. - No need for a new PF_NO_UFFD_WAIT (would be toggled somewhere in/near kvm_faultin_pfn/user_mem_abort). - Userspace has to indirectly figure out the state of the page tables to know what action to take (which introduces some weirdness, like if anyone MADV_DONTNEEDs some guest memory, we need to know). - While userfaultfd is registered (so like during post-copy), any hva_to_pfn() calls that were resolvable with slow GUP before (without dropping into handle_userfault()) will now need to be resolved by userspace manually with a call to MADV_POPULATE_WRITE. This extra trip to userspace could slow things down. Both of these seem pretty simple to implement in the kernel; the most complicated part is just returning KVM_EXIT_MEMORY_FAULT in more places / for other architectures (I care about x86 and arm64). Right now both approaches seem fine to me. Not having to take the mmap_lock in the fault path, while being such a minor difference now, could be a huge benefit if we can later get around to making UFFDIO_CONTINUE not need the mmap lock. Disregarding that, not requiring userspace to guess the state of the page tables seems helpful (less bug-prone, I guess). > > > When KVM_RUN exits: > > - If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN. > > - If we have, then something bad has happened. Slow GUP already ran > > and failed, so we need to treat this in the same way we treat a > > MADV_POPULATE_WRITE failure above: userspace might just want to crash > > (or inject a memory error or something). > > > > - James