On Thu, Apr 20, 2023 at 2:29 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > Yes I don't understand why vanilla uffd is so different, neither am I sure > what does the graph mean, though. :) > > Is the first drop caused by starting migration/precopy? > > Is the 2nd (huge) drop (mostly to zero) caused by frequently accessing new > pages during postcopy? Right on both counts. By the way, for anyone who notices that the userfaultfd (red/yellow) lines never recover to the initial level of performance, whereas the blue line does: that's a separate issue, please ignore :) > Is the workload busy writes single thread, or NCPU threads? One thread per vCPU. > Is what you mentioned on the 25%-75% comparison can be shown on the graph? > Or maybe that's part of the period where all three are very close to 0? Yes, unfortunately the absolute size of the improvement is still pretty small (we go from ~50 writes/s to ~150), so all looks like zero with this scale. > > The second is the redis memtier benchmark [1], a more realistic > > workflow where we migrate a VM running the redis server. With scalable > > userfaultfd, the client VM observes significantly higher transaction > > rates during uffd-based postcopy (see "Memtier.png"). I can pull the > > exact numbers if needed, but just from eyeballing the graph you can > > see that the improvement is something like 5-10x (at least) for > > several seconds. There's still a noticeable gap with KVM demand paging > > based-postcopy, but the improvement is definitely significant. > > > > [1] https://github.com/RedisLabs/memtier_benchmark > > Does the "5-10x" difference rely in the "15s valley" you pointed out in the > graph? Not quite sure what you mean: I meant to point out that the ~15s valley is where we observe improvements due to scalable userfaultfd. For most of that valley, the speedup of scalable uffd is 5-10x (or something, I admit to eyeballing those numbers :) > Is it reproduceable that the blue line always has a totally different > "valley" comparing to yellow/red? Yes, but the offset of that valley is just precopy taking longer for some reason on that configuration. Honestly it's probably just better to ignore the blue line, since that's a google-specific stack. > Personally I still really want to know what happens if we just split the > vma and see how it goes with a standard workloads, but maybe I'm asking too > much so don't yet worry. The solution here proposed still makes sense to > me and I agree if this can be done well it can resolve the bottleneck over > 1-userfaultfd. > > But after I read some of the patches I'm not sure whether it's possible it > can be implemented in a complete way. You mentioned here and there on that > things can be missing probably due to random places accessing guest pages > all over kvm. Relying sololy on -EFAULT so far doesn't look very reliable > to me, but it could be because I didn't yet really understand how it works. > > Is above a concern to the current solution? Based on your comment in [1], I think your impression of this series is that it tries to (a) catch all of the cases where userfaultfd would be triggered and (b) bypass userfaultfd by surfacing the page faults via vCPU exit. That's only happening in two places (the KVM_ABSENT_MAPPING_FAULT changes) corresponding to the EPT violation handler on x86 and the arm64 equivalent. Bypassing the queuing of faults onto a uffd in those two cases, and instead delivering those faults via vCPU exit, is what provides the performance gains I'm demonstrating. However, all of the other changes (KVM_MEMORY_FAULT_INFO, the bulk of this series) are totally unrelated to if/how faults are queued onto userfaultfd. Page faults from copy_to_user/copy_from_user, etc will continue to be delivered via uffd (if one is registered, obviously), and changing that is *not* a goal. All that KVM_MEMORY_FAULT_INFO does is deliver some extra information to userspace in cases where KVM_RUN currently just returns -EFAULT. Hopefully this, and my response to [1], clears things up. If not, let me know and I'll be glad to discuss further. [1] https://lore.kernel.org/kvm/ZEGuogfbtxPNUq7t@x1n/T/#m76f940846ecc94ea85efa80ffbe42366c2352636 > Have any of you tried to investigate the other approach to scale > userfaultfd? As Axel mentioned we considered sharding VMAs but didn't pursue it for a few different reasons. > It seems userfaultfd does one thing great which is to have the trapping at > an unified place (when the page fault happens), hence it doesn't need to > worry on random codes splat over KVM module read/write a guest page. The > question is whether it'll be easy to do so. See a couple of notes above. > Split vma definitely is still a way to scale userfaultfd, but probably not > in a good enough way because it's scaling in memory axis, not cores. If > tens of cores accessing a small region that falls into the same VMA, then > it stops working. > > However maybe it can be scaled in other form? So far my understanding is > "read" upon uffd for messages is still not a problem - the read can be done > in chunk, and each message will be converted into a request to be send > later. > > If the real problem relies in a bunch of threads queuing, is it possible > that we can provide just more queues for the events? The readers will just > need to go over all the queues. > > Way to decide "which thread uses which queue" can be another problem, what > comes ups quickly to me is a "hash(tid) % n_queues" but maybe it can be > better. Each vcpu thread will have different tids, then they can hopefully > scale on the queues. > > There's at least one issue that I know with such an idea, that after we > have >1 uffd queues it means the message order will be uncertain. It may > matter for some uffd users (e.g. cooperative userfaultfd, see > UFFD_FEATURE_FORK|REMOVE|etc.) because I believe order of messages matter > for them (mostly CRIU). But I think that's not a blocker either because we > can forbid those features with multi queues. > > That's a wild idea that I'm just thinking about, which I have totally no > idea whether it'll work or not. It's more or less of a generic question on > "whether there's chance to scale on uffd side just in case it might be a > cleaner approach", when above concern is a real concern. You bring up a good point, which is that this series only deals with uffd's performance in the context of KVM. I had another idea in this vein, which was to allow dedicating queues to certain threads: I even threw together a prototype, though there was some bug in it which stopped me from ever getting a real signal :( I think there's still potential to make uffd itself faster but, as you point out, that might get messy from an API perspective (I know my prototype did :) and is going to require more investigation and prototyping. The advantage of this approach is that it's simple, makes a lot of conceptual sense IMO (in that the previously-stalled vCPU threads can now participate in the work of demand fetching), and solves a very important (probably *the* most important) bottleneck when it comes to KVM + uffd-based postcopy.