On Tue, Feb 14, 2023 at 5:16 PM Anish Moorthy <amoorthy@xxxxxxxxxx> wrote: > > This series improves scalabiity with userfaultfd-based postcopy live > migration. It implements the no-slow-gup approach which James Houghton > described in his earlier RFC ([1]). The new cap > KVM_CAP_MEM_FAULT_NOWAIT, is introduced, which causes KVM to exit to > userspace if fast get_user_pages (GUP) fails while resolving a page > fault. The motivation is to allow (most) EPT violations to be resolved > without going through userfaultfd, which involves serializing faults on > internal locks: see [1] for more details. To clarify a little bit here: One big question: Why do we need a new KVM CAP? Couldn't we just use UFFD_FEATURE_SIGBUS? The original RFC thread[1] addresses this question, but to reiterate here: the difference comes down to non-vCPU guest memory accesses, like if KVM needs to read memory to emulate an instruction. If we use UFFD_FEATURE_SIGBUS, KVM's copy_{to,from}_user will just fail, and the VM will probably just die (depending on what exactly KVM was trying to do). In these cases, we want KVM to sleep in handle_userfault(). Given that we couldn't just use UFFD_FEATURE_SIGBUS, a new KVM CAP seemed to be the most natural solution. > After receiving the new exit, userspace can check if it has previously > UFFDIO_COPY/CONTINUEd the faulting address- if not, then it knows that > fast GUP could not possibly have succeeded, and so the fault has to be > resolved via UFFDIO_COPY/CONTINUE. In these cases a UFFDIO_WAKE is > unnecessary, as the vCPU thread hasn't been put to sleep waiting on the > uffd. > > If userspace *has* already COPY/CONTINUEd the address, then it must take > some other action to make fast GUP succeed: such as swapping in the > page (for instance, via MADV_POPULATE_WRITE for writable mappings). > > This feature should only be enabled during userfaultfd postcopy, as it > prevents the generation of async page faults. > > The actual kernel changes to implement the change on arm64/x86 are > small: most of this series is actually just adding support for the new > feature in the demand paging self test. Performance samples (rates > reported in thousands of pages/s, average of five runs each) generated > using [2] on an x86 machine with 256 cores, are shown below. > > vCPUs, Paging Rate (w/o new cap), Paging Rate (w/ new cap) > 1 150 340 > 2 191 477 > 4 210 809 > 8 155 1239 > 16 130 1595 > 32 108 2299 > 64 86 3482 > 128 62 4134 > 256 36 4012 Thank you, Anish! :) > [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@xxxxxxxxxxxxxx/