Re: [PATCH v3 00/22] Improve scalability of KVM + userfaultfd live migration via annotated memory faults.

Anish Moorthy <amoorthy@xxxxxxxxxx> · Wed, 10 May 2023 09:35:52 -0700

On Tue, May 9, 2023 at 3:19 PM David Matlack <dmatlack@xxxxxxxxxx> wrote:
>
> On Wed, Apr 12, 2023 at 09:34:48PM +0000, Anish Moorthy wrote:
> > Upon receiving an annotated EFAULT, userspace may take appropriate
> > action to resolve the failed access. For instance, this might involve a
> > UFFDIO_CONTINUE or MADV_POPULATE_WRITE in the context of uffd-based live
> > migration postcopy.
>
> As implemented, I think it will be prohibitively expensive if not
> impossible for userspace to determine why KVM is returning EFAULT when
> KVM_CAP_ABSENT_MAPPING_FAULT is enabled, which means userspace can't
> decide the correct action to take (try to resolve or bail).
>
> Consider the direct_map() case in patch in PATCH 15. The only way to hit
> that condition is a logic bug in KVM or data corruption. There isn't
> really anything userspace can do to handle this situation, and it has no
> way to distinguish that from faults to due absent mappings.
>
> We could end up hitting cases where userspace loops forever doing
> KVM_RUN, EFAULT, UFFDIO_CONTINUE/MADV_POPULATE_WRITE, KVM_RUN, EFAULT...
>
> Maybe we should just change direct_map() to use KVM_BUG() and return
> something other than EFAULT. But the general problem still exists and
> even if we have confidence in all the current EFAULT sites, we don't have
> much protection against someone adding an EFAULT in the future that
> userspace can't handle.

Hmm, I had been operating under the assumption that userspace would
always have been able to make the memory access succeed somehow- I
(naively) didn't count on some guest memory access errors being
unrecoverable.

If that's the case, then we're back to needing some way to distinguish
the new faults/exits emitted by user_mem_abort/kvm_faultin_pfn with
the ABSENT_MAPPING_FAULT cap enabled :/ Let me paste in a bit of what
Sean said to refute the idea of a special page-fault-failure set in
those spots.

(from https://lore.kernel.org/kvm/ZBoIzo8FGxSyUJ2I@xxxxxxxxxx/)
On Tue, Mar 21, 2023 at 12:43 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> Setting a flag that essentially says "failure when handling a guest page fault"
> is problematic on multiple fronts.  Tying the ABI to KVM's internal implementation
> is not an option, i.e. the ABI would need to be defined as "on page faults from
> the guest".  And then the resulting behavior would be non-deterministic, e.g.
> userspace would see different behavior if KVM accessed a "bad" gfn via emulation
> instead of in response to a guest page fault.  And because of hardware TLBs, it
> would even be possible for the behavior to be non-deterministic on the same
> platform running the same guest code (though this would be exteremly unliklely
> in practice).
>
> And even if userspace is ok with only handling guest page faults_today_, I highly
> doubt that will hold forever.  I.e. at some point there will be a use case that
> wants to react to uaccess failures on fast-only memslots.
>
> Ignoring all of those issues, simplify flagging "this -EFAULT occurred when
> handling a guest page fault" isn't precise enough for userspace to blindly resolve
> the failure.  Even if KVM went through the trouble of setting information if and
> only if get_user_page_fast_only() failed while handling a guest page fault,
> userspace would still need/want a way to verify that the failure was expected and
> can be resolved, e.g. to guard against userspace bugs due to wrongly unmapping
> or mprotecting a page.

I wonder, how much of this problem comes down to my description/name
(I suggested MEMFAULT_REASON_PAGE_FAULT_FAILURE) for the flag? I see
Sean's concerns of the behavior issues when fast-only pages are
accessed via guest mode or via emulation/uaccess. What if the
description of the fast-only fault cap was tightened to something like
"generates vcpu faults/exits in response to *EPT/SLAT violations*
which cannot be mapped by present userspace page table entries?" I
think that would eliminate the emulation/uaccess issues (though I may
be wrong, so please let me know).

Of course, by the time we get to kvm_faultin_pfn we don't know that
we're faulting pages in response to an EPT violation... but if the
idea makes sense then that might justify some plumbing code.