On Mon, Jul 29, 2024 at 10:17 AM Nikita Kalyazin <kalyazin@xxxxxxxxxx> wrote: > > On 26/07/2024 19:00, James Houghton wrote: > > If it would be useful, we could absolutely have a flag to have all > > faults go through the asynchronous mechanism. :) It's meant to just be > > an optimization. For me, it is a necessary optimization. > > > > Userfaultfd doesn't scale particularly well: we have to grab two locks > > to work with the wait_queues. You could create several userfaultfds, > > but the underlying issue is still there. KVM Userfault, if it uses a > > wait_queue for the async fault mechanism, will have the same > > bottleneck. Anish and I worked on making userfaults more scalable for > > KVM[1], and we ended up with a scheme very similar to what we have in > > this KVM Userfault series. > Yes, I see your motivation. Does this approach support async pagefaults > [1]? Ie would all the guest processes on the vCPU need to stall until a > fault is resolved or is there a way to let the vCPU run and only block > the faulted process? As implemented, it didn't hook into the async page faults stuff. I think it's technically possible to do that, but we didn't explore it. > A more general question is, it looks like Userfaultfd's main purpose was > to support the postcopy use case [2], yet it fails to do that > efficiently for large VMs. Would it be ideologically better to try to > improve Userfaultfd's performance (similar to how it was attempted in > [3]) or is that something you have already looked into and reached a > dead end as a part of [4]? My end goal with [4] was to take contention out of the vCPU + userfault path completely (so, if we are taking a lock exclusively, we are the only one taking it). I came to the conclusion that the way to do this that made the most sense was Anish's memory fault exits idea. I think it's possible to make userfaults scale better themselves, but it's much more challenging than the memory fault exits approach for KVM (and I don't have a good way to do it in mind). > [1] https://lore.kernel.org/lkml/4AEFB823.4040607@xxxxxxxxxx/T/ > [2] https://lwn.net/Articles/636226/ > [3] https://lore.kernel.org/lkml/20230905214235.320571-1-peterx@xxxxxxxxxx/ > [4] > https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@xxxxxxxxxxxxxx/ > > > My use case already requires using a reasonably complex API for > > interacting with a separate userland process for fetching memory, and > > it's really fast. I've never tried to hook userfaultfd into this other > > process, but I'm quite certain that [1] + this process's interface > > scale better than userfaultfd does. Perhaps userfaultfd, for > > not-so-scaled-up cases, could be *slightly* faster, but I mostly care > > about what happens when we scale to hundreds of vCPUs. > > > > [1]: https://lore.kernel.org/kvm/20240215235405.368539-1-amoorthy@xxxxxxxxxx/ > Do I understand it right that in your setup, when an EPT violation occurs, > - VMM shares the fault information with the other process via a > userspace protocol > - the process fetches the memory, installs it (?) and notifies VMM > - VMM calls KVM run to resume execution > ? That's right. > Would you be ok to share an outline of the API you mentioned? I can share some information. The source (remote) and target (local) VMMs register guest memory (shared memory) with this network worker process. On the target during post-copy, the gfn of a fault is converted into its corresponding local and remote offsets. The API for then fetching the memory is basically something like CopyFromRemote(remote_offset, local_offset, length), and the communication with the process to handle this command is done just with shared memory. After memory is copied, the faulting thread does a UFFDIO_CONTINUE (with MODE_DONTWAKE) to map the page, and then we KVM_RUN to resume. This will make more sense with the description of UFFDIO_CONTINUE below. Let me know if you'd like to know more, though I'm not intimately familiar with all the details of this network worker process. > >> How do you envision resolving faults in userspace? Copying the page in > >> (provided that userspace mapping of guest_memfd is supported [3]) and > >> clearing the KVM_MEMORY_ATTRIBUTE_USERFAULT alone do not look > >> sufficient to resolve the fault because an attempt to copy the page > >> directly in userspace will trigger a fault on its own > > > > This is not true for KVM Userfault, at least for right now. Userspace > > accesses to guest memory will not trigger KVM Userfaults. (I know this > > name is terrible -- regular old userfaultfd() userfaults will indeed > > get triggered, provided you've set things up properly.) > > > > KVM Userfault is merely meant to catch KVM's own accesses to guest > > memory (including vCPU accesses). For non-guest_memfd memslots, > > userspace can totally just write through the VMA it has made (KVM > > Userfault *cannot*, by virtue of being completely divorced from mm, > > intercept this access). For guest_memfd, userspace could write to > > guest memory through a VMA if that's where guest_memfd is headed, but > > perhaps it will rely on exact details of how userspace is meant to > > populate guest_memfd memory. > True, it isn't the case right now. I think I fast-forwarded to a state > where notifications about VMM-triggered faults to the guest_memfd are > also sent asynchronously. > > > In case it's interesting or useful at all, we actually use > > UFFDIO_CONTINUE for our live migration use case. We mmap() memory > > twice -- one of them we register with userfaultfd and also give to > > KVM. The other one we use to install memory -- our non-faulting view > > of guest memory! > That is interesting. You're replacing UFFDIO_COPY (vma1) with a memcpy > (vma2) + UFFDIO_CONTINUE (vma1), IIUC. Are both mappings created by the > same process? What benefits does it bring? The cover letter for the patch series where UFFDIO_CONTINUE was introduced does a good job at explaining why it's useful for live migration[5]. But I can summarize it here: when doing pre-copy, we send many copies of memory to the target. Upon resuming on the target, we want to get faults on the pages with stale content. It may take a while to send the final dirty bitmap to the target, and we don't want to leave the VM paused for that long (i.e., treat everything as stale). When the dirty bitmap arrives, we want to be able to quickly (like, without having to copy anything) say "stop getting faults on these pages, they are in fact clean." Using shared memory (i.e., having a page cache) with UFFDIO_CONTINUE (well, really UFFD_FEATURE_MINOR*) allows us to do this. It also turns out that it is basically necessary if we want our network API of choice to be able to directly write into guest memory. [5]: https://lore.kernel.org/linux-mm/20210225002658.2021807-1-axelrasmussen@xxxxxxxxxx/