Re: [RFC PATCH 0/6] KVM: x86: async PF user

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 19 Feb 2025 07:17:36 -0800

On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
> On 11/02/2025 21:17, Sean Christopherson wrote:
> > On Mon, Nov 18, 2024, Nikita Kalyazin wrote:
> > And it's not just the code itself, it's all the structures and concepts.  Off the
> > top of my head, I can't think of any reason there needs to be a separate queue,
> > separate lock(s), etc.  The only difference between kernel APF and user APF is
> > what chunk of code is responsible for faulting in the page.
> 
> There are two queues involved:
>  - "queue": stores in-flight faults. APF-kernel uses it to cancel all works
> if needed.  APF-user does not have a way to "cancel" userspace works, but it
> uses the queue to look up the struct by the token when userspace reports a
> completion.
>  - "ready": stores completed faults until KVM finds a chance to tell guest
> about them.
> 
> I agree that the "ready" queue can be shared between APF-kernel and -user as
> it's used in the same way.  As for the "queue" queue, do you think it's ok
> to process its elements differently based on the "type" of them in a single
> loop [1] instead of having two separate queues?

Yes.

> [1] https://elixir.bootlin.com/linux/v6.13.2/source/virt/kvm/async_pf.c#L120
> 
> > I suspect a good place to start would be something along the lines of the below
> > diff, and go from there.  Given that KVM already needs to special case the fake
> > "wake all" items, I'm guessing it won't be terribly difficult to teach the core
> > flows about userspace async #PF.
> 
> That sounds sensible.  I can certainly approach it in a "bottom up" way by
> sparingly adding handling where it's different in APF-user rather than
> adding it side by side and trying to merge common parts.
> 
> > I'm also not sure that injecting async #PF for all userfaults is desirable.  For
> > in-kernel async #PF, KVM knows that faulting in the memory would sleep.  For
> > userfaults, KVM has no way of knowing if the userfault will sleep, i.e. should
> > be handled via async #PF.  The obvious answer is to have userspace only enable
> > userspace async #PF when it's useful, but "an all or nothing" approach isn't
> > great uAPI.  On the flip side, adding uAPI for a use case that doesn't exist
> > doesn't make sense either :-/
> 
> I wasn't able to locate the code that would check whether faulting would
> sleep in APF-kernel.  KVM spins APF-kernel whenever it can ([2]). Please let
> me know if I'm missing something here.

kvm_can_do_async_pf() will be reached if and only if faulting in the memory
requires waiting.  If a page is swapped out, but faulting it back in doesn't
require waiting, e.g. because it's in zswap and can be uncompressed synchronously,
then the initial __kvm_faultin_pfn() with FOLL_NO_WAIT will succeed.

	/*
	 * If resolving the page failed because I/O is needed to fault-in the
	 * page, then either set up an asynchronous #PF to do the I/O, or if
	 * doing an async #PF isn't possible, retry with I/O allowed.  All
	 * other failures are terminal, i.e. retrying won't help.
	 */
	if (fault->pfn != KVM_PFN_ERR_NEEDS_IO)
		return RET_PF_CONTINUE;

	if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) {
		trace_kvm_try_async_get_page(fault->addr, fault->gfn);
		if (kvm_find_async_pf_gfn(vcpu, fault->gfn)) {
			trace_kvm_async_pf_repeated_fault(fault->addr, fault->gfn);
			kvm_make_request(KVM_REQ_APF_HALT, vcpu);
			return RET_PF_RETRY;
		} else if (kvm_arch_setup_async_pf(vcpu, fault)) {
			return RET_PF_RETRY;
		}
	}

The conundrum with userspace async #PF is that if userspace is given only a single
bit per gfn to force an exit, then KVM won't be able to differentiate between
"faults" that will be handled synchronously by the vCPU task, and faults that
usersepace will hand off to an I/O task.  If the fault is handled synchronously,
KVM will needlessly inject a not-present #PF and a present IRQ.

But that's a non-issue if the known use cases are all-or-nothing, i.e. if all
userspace faults are either synchronous or asynchronous.

> [2] https://elixir.bootlin.com/linux/v6.13.2/source/arch/x86/kvm/mmu/mmu.c#L4360
> 
> > Exiting to userspace in vCPU context is also kludgy.  It makes sense for base
> > userfault, because the vCPU can't make forward progress until the fault is
> > resolved.  Actually, I'm not even sure it makes sense there.  I'll follow-up in
> 
> Even though we exit to userspace, in case of APF-user, userspace is supposed
> to VM enter straight after scheduling the async job, which is then executed
> concurrently with the vCPU.
> 
> > James' series.  Anyways, it definitely doesn't make sense for async #PF, because
> > the whole point is to let the vCPU run.  Signalling userspace would definitely
> > add complexity, but only because of the need to communicate the token and wait
> > for userspace to consume said token.  I'll think more on that.
> 
> By signalling userspace you mean a new non-exit-to-userspace mechanism
> similar to UFFD?

Yes.

> What advantage can you see in it over exiting to userspace (which already exists
> in James's series)?

It doesn't exit to userspace :-)

If userspace simply wakes a different task in response to the exit, then KVM
should be able to wake said task, e.g. by signalling an eventfd, and resume the
guest much faster than if the vCPU task needs to roundtrip to userspace.  Whether
or not such an optimization is worth the complexity is an entirely different
question though.