Hi, Anish, [Copied Nadav Amit for the last few paragraphs on userfaultfd, because Nadav worked on a few userfaultfd performance problems; so maybe he'll also have some ideas around] On Wed, Apr 19, 2023 at 02:53:46PM -0700, Anish Moorthy wrote: > On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote: > > > We considered sharding into several UFFDs. I do think it helps, but > > > also I think there are two main problems with it... > > > > But I agree I can never justify that it'll always work. If you or Anish > > could provide some data points to further support this issue that would be > > very interesting and helpful, IMHO, not required though. > > Axel covered the reasons for not pursuing the sharding approach nicely > (thanks!). It's not something we ever prototyped, so I don't have any > further numbers there. > > On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote: > > > > > I think we could share numbers from some of our internal benchmarks, > > > or at the very least give relative numbers (e.g. +50% increase), but > > > since a lot of the software stack is proprietary (e.g. we don't use > > > QEMU), it may not be that useful or reproducible for folks. > > > > Those numbers can still be helpful. I was not asking for reproduceability, > > but some test to better justify this feature. > > I do have some internal benchmarking numbers on this front, although > it's been a while since I've collected them so the details might be a > little sparse. Thanks for sharing these data points. I don't understand most of them yet, but I think it's better than the unit test numbers provided. > > I've confirmed performance gains with "scalable userfaultfd" using two > workloads besides the self-test: > > The first, cycler, spins up a VM and launches a binary which (a) maps > a large amount of memory and then (b) loops over it issuing writes as > fast as possible. It's not a very realistic guest but it at least > involves an actual migrating VM, and we often use it to > stress/performance test migration changes. The write rate which cycler > achieves during userfaultfd-based postcopy (without scalable uffd > enabled) is about 25% of what it achieves under KVM Demand Paging (the > internal KVM feature GCE currently uses for postcopy). With > userfaultfd-based postcopy and scalable uffd enabled that rate jumps > nearly 3x, so about 75% of what KVM Demand Paging achieves. The > attached "Cycler.png" illustrates this effect (though due to some > other details, faster demand paging actually makes the migrations > worse: the point is that scalable uffd performs more similarly to kvm > demand paging :) Yes I don't understand why vanilla uffd is so different, neither am I sure what does the graph mean, though. :) Is the first drop caused by starting migration/precopy? Is the 2nd (huge) drop (mostly to zero) caused by frequently accessing new pages during postcopy? Is the workload busy writes single thread, or NCPU threads? Is what you mentioned on the 25%-75% comparison can be shown on the graph? Or maybe that's part of the period where all three are very close to 0? > > The second is the redis memtier benchmark [1], a more realistic > workflow where we migrate a VM running the redis server. With scalable > userfaultfd, the client VM observes significantly higher transaction > rates during uffd-based postcopy (see "Memtier.png"). I can pull the > exact numbers if needed, but just from eyeballing the graph you can > see that the improvement is something like 5-10x (at least) for > several seconds. There's still a noticeable gap with KVM demand paging > based-postcopy, but the improvement is definitely significant. > > [1] https://github.com/RedisLabs/memtier_benchmark Does the "5-10x" difference rely in the "15s valley" you pointed out in the graph? Is it reproduceable that the blue line always has a totally different "valley" comparing to yellow/red? Personally I still really want to know what happens if we just split the vma and see how it goes with a standard workloads, but maybe I'm asking too much so don't yet worry. The solution here proposed still makes sense to me and I agree if this can be done well it can resolve the bottleneck over 1-userfaultfd. But after I read some of the patches I'm not sure whether it's possible it can be implemented in a complete way. You mentioned here and there on that things can be missing probably due to random places accessing guest pages all over kvm. Relying sololy on -EFAULT so far doesn't look very reliable to me, but it could be because I didn't yet really understand how it works. Is above a concern to the current solution? Have any of you tried to investigate the other approach to scale userfaultfd? It seems userfaultfd does one thing great which is to have the trapping at an unified place (when the page fault happens), hence it doesn't need to worry on random codes splat over KVM module read/write a guest page. The question is whether it'll be easy to do so. Split vma definitely is still a way to scale userfaultfd, but probably not in a good enough way because it's scaling in memory axis, not cores. If tens of cores accessing a small region that falls into the same VMA, then it stops working. However maybe it can be scaled in other form? So far my understanding is "read" upon uffd for messages is still not a problem - the read can be done in chunk, and each message will be converted into a request to be send later. If the real problem relies in a bunch of threads queuing, is it possible that we can provide just more queues for the events? The readers will just need to go over all the queues. Way to decide "which thread uses which queue" can be another problem, what comes ups quickly to me is a "hash(tid) % n_queues" but maybe it can be better. Each vcpu thread will have different tids, then they can hopefully scale on the queues. There's at least one issue that I know with such an idea, that after we have >1 uffd queues it means the message order will be uncertain. It may matter for some uffd users (e.g. cooperative userfaultfd, see UFFD_FEATURE_FORK|REMOVE|etc.) because I believe order of messages matter for them (mostly CRIU). But I think that's not a blocker either because we can forbid those features with multi queues. That's a wild idea that I'm just thinking about, which I have totally no idea whether it'll work or not. It's more or less of a generic question on "whether there's chance to scale on uffd side just in case it might be a cleaner approach", when above concern is a real concern. Thanks, -- Peter Xu