Re: [RFC PATCH 0/6] KVM: x86: async PF user

Nikita Kalyazin <kalyazin@xxxxxxxxxx> · Thu, 20 Feb 2025 18:29:01 +0000

On 19/02/2025 15:17, Sean Christopherson wrote:
On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
On 11/02/2025 21:17, Sean Christopherson wrote:
On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
And it's not just the code itself, it's all the structures and concepts.  Off the
top of my head, I can't think of any reason there needs to be a separate queue,
separate lock(s), etc.  The only difference between kernel APF and user APF is
what chunk of code is responsible for faulting in the page.

There are two queues involved:
  - "queue": stores in-flight faults. APF-kernel uses it to cancel all works
if needed.  APF-user does not have a way to "cancel" userspace works, but it
uses the queue to look up the struct by the token when userspace reports a
completion.
  - "ready": stores completed faults until KVM finds a chance to tell guest
about them.

I agree that the "ready" queue can be shared between APF-kernel and -user as
it's used in the same way.  As for the "queue" queue, do you think it's ok
to process its elements differently based on the "type" of them in a single
loop [1] instead of having two separate queues?

Yes.

[1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120

I suspect a good place to start would be something along the lines of the below
diff, and go from there.  Given that KVM already needs to special case the fake
"wake all" items, I'm guessing it won't be terribly difficult to teach the core
flows about userspace async #PF.

That sounds sensible.  I can certainly approach it in a "bottom up" way by
sparingly adding handling where it's different in APF-user rather than
adding it side by side and trying to merge common parts.

I'm also not sure that injecting async #PF for all userfaults is desirable.  For
in-kernel async #PF, KVM knows that faulting in the memory would sleep.  For
userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
be handled via async #PF.  The obvious answer is to have userspace only enable
userspace async #PF when it's useful, but "an all or nothing" approach isn't
great uAPI.  On the flip side, adding uAPI for a use case that doesn't exist
doesn't make sense either :-/

I wasn't able to locate the code that would check whether faulting would
sleep in APF-kernel.  KVM spins APF-kernel whenever it can ([2]). Please let
me know if I'm missing something here.

kvm_can_do_async_pf() will be reached if and only if faulting in the memory
requires waiting.  If a page is swapped out, but faulting it back in doesn't
require waiting, e.g. because it's in zswap and can be uncompressed synchronously,
then the initial __kvm_faultin_pfn() with FOLL_NO_WAIT will succeed.

         /*
          * If resolving the page failed because I/O is needed to fault-in the
          * page, then either set up an asynchronous #PF to do the I/O, or if
          * doing an async #PF isn't possible, retry with I/O allowed.  All
          * other failures are terminal, i.e. retrying won't help.
          */
         if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
                 return RET_PF_CONTINUE;

         if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
                 trace_kvm_try_async_get_page(fault->addr, fault->gfn);
                 if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
                         trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
                         kvm_make_request(KVM_REQ_APF_HALT, vcpu);
                         return RET_PF_RETRY;
                 } else if (kvm_arch_setup_async_pf(vcpu, fault)) {
                         return RET_PF_RETRY;
                 }
         }

The conundrum with userspace async #PF is that if userspace is given only a single
bit per gfn to force an exit, then KVM won't be able to differentiate between
"faults" that will be handled synchronously by the vCPU task, and faults that
usersepace will hand off to an I/O task.  If the fault is handled synchronously,
KVM will needlessly inject a not-present #PF and a present IRQ.

Right, but from the guest's point of view, async PF means "it will 
probably take a while for the host to get the page, so I may consider 
doing something else in the meantime (ie schedule another process if 
available)".  If we are exiting to userspace, it isn't going to be quick 
anyway, so we can consider all such faults "long" and warranting the 
execution of the async PF protocol.  So always injecting a not-present 
#PF and page ready IRQ doesn't look too wrong in that case.

But that's a non-issue if the known use cases are all-or-nothing, i.e. if all
userspace faults are either synchronous or asynchronous.

Yes, pretty much.  The user will be choosing the extreme that is more 
performant for their specific usecase.

[2] https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360

Exiting to userspace in vCPU context is also kludgy.  It makes sense for base
userfault, because the vCPU can't make forward progress until the fault is
resolved.  Actually, I'm not even sure it makes sense there.  I'll follow-up in

Even though we exit to userspace, in case of APF-user, userspace is supposed
to VM enter straight after scheduling the async job, which is then executed
concurrently with the vCPU.

James' series.  Anyways, it definitely doesn't make sense for async #PF, because
the whole point is to let the vCPU run.  Signalling userspace would definitely
add complexity, but only because of the need to communicate the token and wait
for userspace to consume said token.  I'll think more on that.

By signalling userspace you mean a new non-exit-to-userspace mechanism
similar to UFFD?

Yes.

What advantage can you see in it over exiting to userspace (which already exists
in James's series)?

It doesn't exit to userspace :-)

If userspace simply wakes a different task in response to the exit, then KVM
should be able to wake said task, e.g. by signalling an eventfd, and resume the
guest much faster than if the vCPU task needs to roundtrip to userspace.  Whether
or not such an optimization is worth the complexity is an entirely different
question though.

This reminds me of the discussion about VMA-less UFFD that was coming up 
several times, such as [1], but AFAIK hasn't materialised into something 
actionable.  I may be wrong, but James was looking into that and 
couldn't figure out a way to scale it sufficiently for his use case and 
had to stick with the VM-exit-based approach.  Can you see a world where 
VM-exit userfaults coexist with no-VM-exit way of handling async PFs?

[1]: https://lore.kernel.org/kvm/ZqwKuzfAs7pvdHAN@x1n/