Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.

Peter Xu <peterx@xxxxxxxxxx> · Thu, 20 Apr 2023 17:29:06 -0400

Hi, Anish,

[Copied Nadav Amit for the last few paragraphs on userfaultfd, because
 Nadav worked on a few userfaultfd performance problems; so maybe he'll
 also have some ideas around]

On Wed, Apr 19, 2023 at 02:53:46PM -0700, Anish Moorthy wrote:
> On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> > > We considered sharding into several UFFDs. I do think it helps, but
> > > also I think there are two main problems with it...
> >
> > But I agree I can never justify that it'll always work.  If you or Anish
> > could provide some data points to further support this issue that would be
> > very interesting and helpful, IMHO, not required though.
> 
> Axel covered the reasons for not pursuing the sharding approach nicely
> (thanks!). It's not something we ever prototyped, so I don't have any
> further numbers there.
> 
> On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote:
> >
> > > I think we could share numbers from some of our internal benchmarks,
> > > or at the very least give relative numbers (e.g. +50% increase), but
> > > since a lot of the software stack is proprietary (e.g. we don't use
> > > QEMU), it may not be that useful or reproducible for folks.
> >
> > Those numbers can still be helpful.  I was not asking for reproduceability,
> > but some test to better justify this feature.
> 
> I do have some internal benchmarking numbers on this front, although
> it's been a while since I've collected them so the details might be a
> little sparse.

Thanks for sharing these data points.  I don't understand most of them yet,
but I think it's better than the unit test numbers provided.

> 
> I've confirmed performance gains with "scalable userfaultfd" using two
> workloads besides the self-test:
> 
> The first, cycler, spins up a VM and launches a binary which (a) maps
> a large amount of memory and then (b) loops over it issuing writes as
> fast as possible. It's not a very realistic guest but it at least
> involves an actual migrating VM, and we often use it to
> stress/performance test migration changes. The write rate which cycler
> achieves during userfaultfd-based postcopy (without scalable uffd
> enabled) is about 25% of what it achieves under KVM Demand Paging (the
> internal KVM feature GCE currently uses for postcopy). With
> userfaultfd-based postcopy and scalable uffd enabled that rate jumps
> nearly 3x, so about 75% of what KVM Demand Paging achieves. The
> attached "Cycler.png" illustrates this effect (though due to some
> other details, faster demand paging actually makes the migrations
> worse: the point is that scalable uffd performs more similarly to kvm
> demand paging :)

Yes I don't understand why vanilla uffd is so different, neither am I sure
what does the graph mean, though. :)

Is the first drop caused by starting migration/precopy?

Is the 2nd (huge) drop (mostly to zero) caused by frequently accessing new
pages during postcopy?

Is the workload busy writes single thread, or NCPU threads?

Is what you mentioned on the 25%-75% comparison can be shown on the graph?
Or maybe that's part of the period where all three are very close to 0?

> 
> The second is the redis memtier benchmark [1], a more realistic
> workflow where we migrate a VM running the redis server. With scalable
> userfaultfd, the client VM observes significantly higher transaction
> rates during uffd-based postcopy (see "Memtier.png"). I can pull the
> exact numbers if needed, but just from eyeballing the graph you can
> see that the improvement is something like 5-10x (at least) for
> several seconds. There's still a noticeable gap with KVM demand paging
> based-postcopy, but the improvement is definitely significant.
> 
> [1] https://github.com/RedisLabs/memtier_benchmark

Does the "5-10x" difference rely in the "15s valley" you pointed out in the
graph?

Is it reproduceable that the blue line always has a totally different
"valley" comparing to yellow/red?

Personally I still really want to know what happens if we just split the
vma and see how it goes with a standard workloads, but maybe I'm asking too
much so don't yet worry.  The solution here proposed still makes sense to
me and I agree if this can be done well it can resolve the bottleneck over
1-userfaultfd.

But after I read some of the patches I'm not sure whether it's possible it
can be implemented in a complete way.  You mentioned here and there on that
things can be missing probably due to random places accessing guest pages
all over kvm.  Relying sololy on -EFAULT so far doesn't look very reliable
to me, but it could be because I didn't yet really understand how it works.

Is above a concern to the current solution?

Have any of you tried to investigate the other approach to scale
userfaultfd?

It seems userfaultfd does one thing great which is to have the trapping at
an unified place (when the page fault happens), hence it doesn't need to
worry on random codes splat over KVM module read/write a guest page.  The
question is whether it'll be easy to do so.

Split vma definitely is still a way to scale userfaultfd, but probably not
in a good enough way because it's scaling in memory axis, not cores.  If
tens of cores accessing a small region that falls into the same VMA, then
it stops working.

However maybe it can be scaled in other form?  So far my understanding is
"read" upon uffd for messages is still not a problem - the read can be done
in chunk, and each message will be converted into a request to be send
later.

If the real problem relies in a bunch of threads queuing, is it possible
that we can provide just more queues for the events?  The readers will just
need to go over all the queues.

Way to decide "which thread uses which queue" can be another problem, what
comes ups quickly to me is a "hash(tid) % n_queues" but maybe it can be
better.  Each vcpu thread will have different tids, then they can hopefully
scale on the queues.

There's at least one issue that I know with such an idea, that after we
have >1 uffd queues it means the message order will be uncertain.  It may
matter for some uffd users (e.g. cooperative userfaultfd, see
UFFD_FEATURE_FORK|REMOVE|etc.)  because I believe order of messages matter
for them (mostly CRIU).  But I think that's not a blocker either because we
can forbid those features with multi queues.

That's a wild idea that I'm just thinking about, which I have totally no
idea whether it'll work or not.  It's more or less of a generic question on
"whether there's chance to scale on uffd side just in case it might be a
cleaner approach", when above concern is a real concern.

Thanks,

-- 
Peter Xu