Hi, Anish, On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote: > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > I explained why I think it could be useful to test this in my reply to > > Nadav, do you think it makes sense to you? > > Ah, I actually missed your reply to Nadav: didn't realize you had sent > *two* emails. > > > While OTOH if multi-uffd can scale well, then there's a chance of > > general solution as long as we can remove the single-queue > > contention over the whole guest mem. > > I don't quite understand your statement here: if we pursue multi-uffd, > then it seems to me that by definition we've removed the single > queue(s) for all of guest memory, and thus the associated contention. > And we'd still have the issue of multiple vCPUs contending for a > single UFFD. Yes as I mentioned it's purely what I was curious and it also shows the best result we can have if we go a more generic solution; it doesn't really solve the issue immediately. > > But I do share some of your curiosity about multi-uffd performance, > especially since some of my earlier numbers indicated that multi-uffd > doesn't scale linearly, even when each vCPU corresponds to a single > UFFD. > > So, I grabbed some more profiles for 32 and 64 vcpus using the following command > ./demand_paging_test -b 512M -u MINOR -s shmem -v <n> -r 1 -c <1,...,n> > > The 32-vcpu config achieves a per-vcpu paging rate of 8.8k. That rate > goes down to 3.9k (!) with 64 vCPUs. I don't immediately see the issue > from the traces, but safe to say it's definitely not scaling. Since I > applied your fixes from earlier, the prefaulting isn't being counted > against the demand paging rate either. > > 32-vcpu profile: > https://drive.google.com/file/d/19ZZDxZArhSsbW_5u5VcmLT48osHlO9TG/view?usp=drivesdk > 64-vcpu profile: > https://drive.google.com/file/d/1dyLOLVHRNdkUoFFr7gxqtoSZGn1_GqmS/view?usp=drivesdk > > Do let me know if you need svg files instead and I'll try and figure that out. Thanks for trying all these out, and sorry if I caused confusion in my reply. What I wanted to do is to understand whether there's still chance to provide a generic solution. I don't know why you have had a bunch of pmu stack showing in the graph, perhaps you forgot to disable some of the perf events when doing the test? Let me know if you figure out why it happened like that (so far I didn't see), but I feel guilty to keep overloading you with such questions. The major problem I had with this series is it's definitely not a clean approach. Say, even if you'll all rely on userapp you'll still need to rely on userfaultfd for kernel traps on corner cases or it just won't work. IIUC that's also the concern from Nadav. But I also agree it seems to resolve every bottleneck in the kernel no matter whether it's in scheduler or vcpu loading. After all you throw everything into userspace.. :) Considering that most of the changes are for -EFAULT traps and the 2nd part change is very self contained and maintainable, no objection here to have it. I'll leave that to the maintainers to decide. Thanks, -- Peter Xu