Re: [RFC] Improving userfaultfd scalability for live migration

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Dec 5, 2022 at 8:06 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Mon, Dec 05, 2022, James Houghton wrote:
> > On Mon, Dec 5, 2022 at 1:20 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > >
> > > On Mon, Dec 05, 2022, David Matlack wrote:
> > > > On Mon, Dec 5, 2022 at 7:30 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
> > > > > ...
> > > > > I'll have a closer read on the nested part, but note that this path already
> > > > > has the mmap lock then it invalidates the goal if we want to avoid taking
> > > > > it from the first place, or maybe we don't care?
> >
> > Not taking the mmap lock would be helpful, but we still have to take
> > it in UFFDIO_CONTINUE, so it's ok if we have to still take it here.
>
> IIUC, Peter is suggesting that the kernel not even get to the point where UFFD
> is involved.  The "fault" would get propagated to userspace by KVM, userspace
> fixes the fault (gets the page from the source, does MADV_POPULATE_WRITE), and
> resumes the vCPU.

If we haven't UFFDIO_CONTINUE'd some address range yet,
MADV_POPULATE_WRITE for that range will drop into handle_userfault and
go to sleep. Not good!

So, going with the no-slow-GUP approach, resolving faults is done like this:
- If we haven't UFFDIO_CONTINUE'd yet, do that now and restart
KVM_RUN. The PTEs will be none/blank right now. This is the common
case.
- If we have UFFDIO_CONTINUE'd already, if we were to do it again, we
would get EEXIST. (In this case, we probably have some type of swap
entry in the page tables.) We have to change the page tables to make
fast GUP succeed now *without* using UFFDIO_CONTINUE now.
MADV_POPULATE_WRITE seems to be the right tool for the job. This case
happens if the kernel has swapped the memory out, is migrating it, has
poisoned it, etc. If MADV_POPULATE_WRITE fails, we probably need to
crash or inject a memory error.

So with this approach, we never need to take the mmap_lock for reading
in hva_to_pfn, but we still need to take it in UFFDIO_CONTINUE.
Without removing the mmap_lock from *both*, we don't gain much.

So if we disregard this tiny mmap_lock benefit, the other approach
(the PF_NO_UFFD_WAIT approach) seems better. When KVM_RUN exits:
- If we haven't UFFDIO_CONTINUE'd yet, do that now and restart KVM_RUN.
- If we have, then something bad has happened. Slow GUP already ran
and failed, so we need to treat this in the same way we treat a
MADV_POPULATE_WRITE failure above: userspace might just want to crash
(or inject a memory error or something).

- James




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux