Re: RFC: A KVM-specific alternative to UserfaultFD

James Houghton <jthoughton@xxxxxxxxxx> · Tue, 7 Nov 2023 08:11:09 -0800

On Tue, Nov 7, 2023 at 6:22 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Mon, Nov 06, 2023 at 03:22:05PM -0800, David Matlack wrote:
> > On Mon, Nov 6, 2023 at 3:03 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > > On Mon, Nov 06, 2023 at 02:24:13PM -0800, Axel Rasmussen wrote:
> > > > On Mon, Nov 6, 2023 at 12:23 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > > > > On Mon, Nov 06, 2023 at 10:25:13AM -0800, David Matlack wrote:
> > > > > >
> > > > > >   * Memory Overhead: UserfaultFD requires an extra 8 bytes per page of
> > > > > >     guest memory for the userspace page table entries.
> > > > >
> > > > > What is this one?
> > > >
> > > > In the way we use userfaultfd, there are two shared userspace mappings
> > > > - one non-UFFD registered one which is used to resolve demand paging
> > > > faults, and another UFFD-registered one which is handed to KVM et al
> > > > for the guest to use. I think David is talking about the "second"
> > > > mapping as overhead here, since with the KVM-based approach he's
> > > > describing we don't need that mapping.
> > >
> > > I see, but then is it userspace relevant?  IMHO we should discuss the
> > > proposal based only on the design itself, rather than relying on any
> > > details on possible userspace implementations if two mappings are not
> > > required but optional.
> >
> > What I mean here is that for UserfaultFD to track accesses at
> > PAGE_SIZE granularity, that requires 1 PTE per page, i.e. 8 bytes per
> > page. Versus the KVM-based approach which only requires 1 bit per page
> > for the present bitmap. This is inherent in the design of UserfaultFD
> > because it uses PTEs to track what is present, not specific to how we
> > use UserfaultFD.
>
> Shouldn't the userspace normally still maintain one virtual mapping anyway
> for the guest address range?  As IIUC kvm still relies a lot on HVA to work
> (at least before guest memfd)? E.g., KVM_SET_USER_MEMORY_REGION, or mmu
> notifiers.  If so, that 8 bytes should be there with/without userfaultfd,
> IIUC.
>
> Also, I think that's not strictly needed for any kind of file memories, as
> in those case userfaultfd works with page cache.

This extra ~8 bytes per page overhead is real, and it is the
theoretical maximum additional overhead that userfaultfd would require
over a KVM-based demand paging alternative when we are using
hugepages. Consider the case where we are using THPs and have just
finished post-copy, and we haven't done any collapsing yet:

For userfaultfd: because we have UFFDIO_COPY'd or UFFDIO_CONTINUE'd at
4K (because we demand-fetched at 4K), the userspace page tables are
entirely shattered. KVM has no choice but to have an entirely
shattered second-stage page table as well.

For KVM demand paging: the userspace page tables can remain entirely
populated, so we get PMD mappings here. KVM, though, uses 4K SPTEs
because we have only just finished post-copy and haven't started
collapsing yet.

So both systems end up with a shattered second stage page table, but
userfaultfd has a shattered userspace page table as well (+8 bytes/4K
if using THP, +another 8 bytes/2M if using HugeTLB-1G, etc.) and that
is where the extra overhead comes from.

The second mapping of guest memory that we use today (through which we
install memory), given that we are using hugepages, will use PMDs and
PUDs, so the overhead is minimal.

Hope that clears things up!

Thanks,
James