Re: [PATCH 0/9] Create a userfaultfd demand paging test

Ben Gardon <bgardon@xxxxxxxxxx> · Mon, 30 Sep 2019 10:02:11 -0700

Hi Peter,
You're absolutely right that we could demonstrate more contention by
avoiding UFFD and just letting the kernel resolve page faults. I used
UFFD in this test and benchmarking for the other MMU patch set because
I believe it's a more realistic scenario. A simpler page access
benchmark would be better for identifying further scaling problems
within the MMU, but the only situation I can think of where that would
be used is VM boot. However, we don't usually see many vCPUs touching
memory all over the place on boot. In a migration or restore without
demand paging, the memory would have to be pre-populated with the
contents of guest memory and the KVM MMU fault handler wouldn't be
taking a fault in get_user_pages. In the interest of eliminating the
delay from UFFD, I will add an option to use anonymous page faults or
prefault memory instead.

I don't have any plans to customize the UFFD implementation at the
moment, but experimenting with UFFD strategies will be useful for
building higher performance post-copy in QEMU and other userspaces in
the future.
Thank you for taking a look at these patches.
Ben

On Sun, Sep 29, 2019 at 12:23 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Fri, Sep 27, 2019 at 09:18:28AM -0700, Ben Gardon wrote:
> > When handling page faults for many vCPUs during demand paging, KVM's MMU
> > lock becomes highly contended. This series creates a test with a naive
> > userfaultfd based demand paging implementation to demonstrate that
> > contention. This test serves both as a functional test of userfaultfd
> > and a microbenchmark of demand paging performance with a variable number
> > of vCPUs and memory per vCPU.
> >
> > The test creates N userfaultfd threads, N vCPUs, and a region of memory
> > with M pages per vCPU. The N userfaultfd polling threads are each set up
> > to serve faults on a region of memory corresponding to one of the vCPUs.
> > Each of the vCPUs is then started, and touches each page of its disjoint
> > memory region, sequentially. In response to faults, the userfaultfd
> > threads copy a static buffer into the guest's memory. This creates a
> > worst case for MMU lock contention as we have removed most of the
> > contention between the userfaultfd threads and there is no time required
> > to fetch the contents of guest memory.
>
> Hi, Ben,
>
> Even though I may not have enough MMU knowledge to say this... this of
> course looks like a good test at least to me.  I'm just curious about
> whether you have plan to customize the userfaultfd handler in the
> future with this infrastructure?
>
> Asked because IIUC with this series userfaultfd only plays a role to
> introduce a relatively adhoc delay to page faults.  In other words,
> I'm also curious what would be the number look like (as you mentioned
> in your MMU rework cover letter) if you simply start hundreds of vcpu
> and do the same test like this, but use the default anonymous page
> faults rather than uffd page faults.  I feel like even without uffd
> that could be a huge contention already there.  Or did I miss anything
> important on your decision to use userfaultfd?
>
> Thanks,
>
> --
> Peter Xu