Hi Peter, You're absolutely right that we could demonstrate more contention by avoiding UFFD and just letting the kernel resolve page faults. I used UFFD in this test and benchmarking for the other MMU patch set because I believe it's a more realistic scenario. A simpler page access benchmark would be better for identifying further scaling problems within the MMU, but the only situation I can think of where that would be used is VM boot. However, we don't usually see many vCPUs touching memory all over the place on boot. In a migration or restore without demand paging, the memory would have to be pre-populated with the contents of guest memory and the KVM MMU fault handler wouldn't be taking a fault in get_user_pages. In the interest of eliminating the delay from UFFD, I will add an option to use anonymous page faults or prefault memory instead. I don't have any plans to customize the UFFD implementation at the moment, but experimenting with UFFD strategies will be useful for building higher performance post-copy in QEMU and other userspaces in the future. Thank you for taking a look at these patches. Ben On Sun, Sep 29, 2019 at 12:23 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > On Fri, Sep 27, 2019 at 09:18:28AM -0700, Ben Gardon wrote: > > When handling page faults for many vCPUs during demand paging, KVM's MMU > > lock becomes highly contended. This series creates a test with a naive > > userfaultfd based demand paging implementation to demonstrate that > > contention. This test serves both as a functional test of userfaultfd > > and a microbenchmark of demand paging performance with a variable number > > of vCPUs and memory per vCPU. > > > > The test creates N userfaultfd threads, N vCPUs, and a region of memory > > with M pages per vCPU. The N userfaultfd polling threads are each set up > > to serve faults on a region of memory corresponding to one of the vCPUs. > > Each of the vCPUs is then started, and touches each page of its disjoint > > memory region, sequentially. In response to faults, the userfaultfd > > threads copy a static buffer into the guest's memory. This creates a > > worst case for MMU lock contention as we have removed most of the > > contention between the userfaultfd threads and there is no time required > > to fetch the contents of guest memory. > > Hi, Ben, > > Even though I may not have enough MMU knowledge to say this... this of > course looks like a good test at least to me. I'm just curious about > whether you have plan to customize the userfaultfd handler in the > future with this infrastructure? > > Asked because IIUC with this series userfaultfd only plays a role to > introduce a relatively adhoc delay to page faults. In other words, > I'm also curious what would be the number look like (as you mentioned > in your MMU rework cover letter) if you simply start hundreds of vcpu > and do the same test like this, but use the default anonymous page > faults rather than uffd page faults. I feel like even without uffd > that could be a huge contention already there. Or did I miss anything > important on your decision to use userfaultfd? > > Thanks, > > -- > Peter Xu