Re: RFC: A KVM-specific alternative to UserfaultFD

James Houghton <jthoughton@xxxxxxxxxx> · Tue, 7 Nov 2023 11:08:28 -0800

On Tue, Nov 7, 2023 at 9:24 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Tue, Nov 07, 2023 at 08:11:09AM -0800, James Houghton wrote:
> > This extra ~8 bytes per page overhead is real, and it is the
> > theoretical maximum additional overhead that userfaultfd would require
> > over a KVM-based demand paging alternative when we are using
> > hugepages. Consider the case where we are using THPs and have just
> > finished post-copy, and we haven't done any collapsing yet:
> >
> > For userfaultfd: because we have UFFDIO_COPY'd or UFFDIO_CONTINUE'd at
> > 4K (because we demand-fetched at 4K), the userspace page tables are
> > entirely shattered. KVM has no choice but to have an entirely
> > shattered second-stage page table as well.
> >
> > For KVM demand paging: the userspace page tables can remain entirely
> > populated, so we get PMD mappings here. KVM, though, uses 4K SPTEs
> > because we have only just finished post-copy and haven't started
> > collapsing yet.
> >
> > So both systems end up with a shattered second stage page table, but
> > userfaultfd has a shattered userspace page table as well (+8 bytes/4K
> > if using THP, +another 8 bytes/2M if using HugeTLB-1G, etc.) and that
> > is where the extra overhead comes from.
> >
> > The second mapping of guest memory that we use today (through which we
> > install memory), given that we are using hugepages, will use PMDs and
> > PUDs, so the overhead is minimal.
> >
> > Hope that clears things up!
>
> Ah I see, thanks James.  Though, is this a real concern in production use,
> considering worst case 0.2% overhead (all THP backed) and only exist during
> postcopy, only on destination host?

Good question. In an ideal world, 0.2% of lost memory isn't a huge
deal, but it would be nice to save as much memory as possible. So I
see this overhead point as a nice win for a KVM-based solution, but it
is not a key deciding factor in what the right move is. (I think the
key deciding factor is: what is the best way to make post-copy work
for 1G pages?)

To elaborate a little more: For Google, I don't think the 0.2% loss is
a huge deal by itself (though I am not exactly an authority here).
There are other memory overheads like this that we have to deal with
anyway. The real challenge for us comes from the fact that we already
have a post-copy system that works and has less overhead. If we were
to replace KVM demand paging with userfaultfd, that means *regressing*
in efficiency/performance. That's the main practical challenge:
dealing with the regression. We have to make sure that VMs can still
be packed to the appropriate efficiency, things like that. At this
moment *I think* this is a solvable problem, but it would be nice to
avoid the problem entirely. But this is Google's problem; I don't
think this point should be the deciding factor here.

- James