My reply to Peter earlier bounced from the mailing list due to the attached images (sorry!). I've copied it below to get a record on-list. Just for completeness, the message ID of the bounced mail was <CAF7b7mo68VLNp=QynfT7QKgdq=d1YYGv1SEVEDxF9UwHzF6YDw@xxxxxxxxxxxxxx> On Wed, Apr 19, 2023 at 2:53 PM Anish Moorthy <amoorthy@xxxxxxxxxx> wrote: > > On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote: > > > We considered sharding into several UFFDs. I do think it helps, but > > > also I think there are two main problems with it... > > > > But I agree I can never justify that it'll always work. If you or Anish > > could provide some data points to further support this issue that would be > > very interesting and helpful, IMHO, not required though. > > Axel covered the reasons for not pursuing the sharding approach nicely > (thanks!). It's not something we ever prototyped, so I don't have any > further numbers there. > > On Wed, Apr 19, 2023 at 2:05 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > On Wed, Apr 19, 2023 at 01:15:44PM -0700, Axel Rasmussen wrote: > > > > > I think we could share numbers from some of our internal benchmarks, > > > or at the very least give relative numbers (e.g. +50% increase), but > > > since a lot of the software stack is proprietary (e.g. we don't use > > > QEMU), it may not be that useful or reproducible for folks. > > > > Those numbers can still be helpful. I was not asking for reproduceability, > > but some test to better justify this feature. > > I do have some internal benchmarking numbers on this front, although > it's been a while since I've collected them so the details might be a > little sparse. > > I've confirmed performance gains with "scalable userfaultfd" using two > workloads besides the self-test: > > The first, cycler, spins up a VM and launches a binary which (a) maps > a large amount of memory and then (b) loops over it issuing writes as > fast as possible. It's not a very realistic guest but it at least > involves an actual migrating VM, and we often use it to > stress/performance test migration changes. The write rate which cycler > achieves during userfaultfd-based postcopy (without scalable uffd > enabled) is about 25% of what it achieves under KVM Demand Paging (the > internal KVM feature GCE currently uses for postcopy). With > userfaultfd-based postcopy and scalable uffd enabled that rate jumps > nearly 3x, so about 75% of what KVM Demand Paging achieves. The > attached "Cycler.png" illustrates this effect (though due to some > other details, faster demand paging actually makes the migrations > worse: the point is that scalable uffd performs more similarly to kvm > demand paging :) > > The second is the redis memtier benchmark [1], a more realistic > workflow where we migrate a VM running the redis server. With scalable > userfaultfd, the client VM observes significantly higher transaction > rates during uffd-based postcopy (see "Memtier.png"). I can pull the > exact numbers if needed, but just from eyeballing the graph you can > see that the improvement is something like 5-10x (at least) for > several seconds. There's still a noticeable gap with KVM demand paging > based-postcopy, but the improvement is definitely significant. > > [1] https://github.com/RedisLabs/memtier_benchmark