On Wed, Apr 19, 2023 at 12:56 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > Hi, Anish, > > On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote: > > KVM's demand paging self test is extended to demonstrate the performance > > benefits of using the two new capabilities to bypass the userfaultfd > > wait queue. The performance samples below (rates in thousands of > > pages/s, n = 5), were generated using [2] on an x86 machine with 256 > > cores. > > > > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps) > > 1 150 340 > > 2 191 477 > > 4 210 809 > > 8 155 1239 > > 16 130 1595 > > 32 108 2299 > > 64 86 3482 > > 128 62 4134 > > 256 36 4012 > > The number looks very promising. Though.. > > > > > [1] https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@xxxxxxxxxxxxxx/ > > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w] > > A quick rundown of the new flags (also detailed in later commits) > > -a registers all of guest memory to a single uffd. > > ... this is the worst case scenario. I'd say it's slightly unfair to > compare by first introducing a bottleneck then compare with it. :) > > Jokes aside: I'd think it'll make more sense if such a performance solution > will be measured on real systems showing real benefits, because so far it's > still not convincing enough if it's only with the test especially with only > one uffd. > > I don't remember whether I used to discuss this with James before, but.. > > I know that having multiple uffds in productions also means scattered guest > memory and scattered VMAs all over the place. However split the guest > large mem into at least a few (or even tens of) VMAs may still be something > worth trying? Do you think that'll already solve some of the contentions > on userfaultfd, either on the queue or else? We considered sharding into several UFFDs. I do think it helps, but also I think there are two main problems with it: - One is, I think there's a limit to how much you'd want to do that. E.g. splitting guest memory in 1/2, or in 1/10, could be reasonable, but 1/100 or 1/1000 might become ridiculous in terms of the "scattering" of VMAs and so on like you mentioned. Especially for very large VMs (e.g. consider Google offers VMs with ~11T of RAM [1]) I'm not sure splitting just "slightly" is enough to get good performance. - Another is, sharding UFFDs sort of assumes accesses are randomly distributed across the guest physical address space. I'm not sure this is guaranteed for all possible VMs / customer workloads. In other words, even if we shard across several UFFDs, we may end up with a small number of them being "hot". A benefit to Anish's series is that it solves the problem more fundamentally, and allows demand paging with no "global" locking. So, it will scale better regardless of VM size, or access pattern. [1]: https://cloud.google.com/compute/docs/memory-optimized-machines > > With a bunch of VMAs and userfaultfds (paired with uffd fault handler > threads, totally separate uffd queues), I'd expect to some extend other > things can pop up already, e.g., the network bandwidth, without teaching > each vcpu thread to report uffd faults themselves. > > These are my pure imaginations though, I think that's also why it'll be > great if such a solution can be tested more or less on a real migration > scenario to show its real benefits. I wonder, is there an existing open source QEMU/KVM based live migration stress test? I think we could share numbers from some of our internal benchmarks, or at the very least give relative numbers (e.g. +50% increase), but since a lot of the software stack is proprietary (e.g. we don't use QEMU), it may not be that useful or reproducible for folks. > > Thanks, > > -- > Peter Xu >