On 20/02/2025 18:49, Sean Christopherson wrote:
On Thu, Feb 20, 2025, Nikita Kalyazin wrote:
On 19/02/2025 15:17, Sean Christopherson wrote:
On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
The conundrum with userspace async #PF is that if userspace is given only a single
bit per gfn to force an exit, then KVM won't be able to differentiate between
"faults" that will be handled synchronously by the vCPU task, and faults that
usersepace will hand off to an I/O task. If the fault is handled synchronously,
KVM will needlessly inject a not-present #PF and a present IRQ.
Right, but from the guest's point of view, async PF means "it will probably
take a while for the host to get the page, so I may consider doing something
else in the meantime (ie schedule another process if available)".
Except in this case, the guest never gets a chance to run, i.e. it can't do
something else. From the guest point of view, if KVM doesn't inject what is
effectively a spurious async #PF, the VM-Exiting instruction simply took a (really)
long time to execute.
Sorry, I didn't get that. If userspace learns from the
kvm_run::memory_fault::flags that the exit is due to an async PF, it
should call kvm run immediately, inject the not-present PF and allow the
guest to reschedule. What do you mean by "the guest never gets a chance
to run"?
If we are exiting to userspace, it isn't going to be quick anyway, so we can
consider all such faults "long" and warranting the execution of the async PF
protocol. So always injecting a not-present #PF and page ready IRQ doesn't
look too wrong in that case.
There is no "wrong", it's simply wasteful. The fact that the userspace exit is
"long" is completely irrelevant. Decompressing zswap is also slow, but it is
done on the current CPU, i.e. is not background I/O, and so doesn't trigger async
#PFs.
In the guest, if host userspace resolves the fault before redoing KVM_RUN, the
vCPU will get two events back-to-back: an async #PF, and an IRQ signalling completion
of that #PF.
Is this practically likely? At least in our scenario (Firecracker
snapshot restore) and probably in live migration postcopy, if a vCPU
hits a fault, it's probably because the content of the page is somewhere
remote (eg on the source machine or wherever the snapshot data is
stored) and isn't going to be available quickly. Conversely, if the
page content is available, it must have already been prepopulated into
guest memory pagecache, the bit in the bitmap is cleared and no exit to
userspace occurs.
What advantage can you see in it over exiting to userspace (which already exists
in James's series)?
It doesn't exit to userspace :-)
If userspace simply wakes a different task in response to the exit, then KVM
should be able to wake said task, e.g. by signalling an eventfd, and resume the
guest much faster than if the vCPU task needs to roundtrip to userspace. Whether
or not such an optimization is worth the complexity is an entirely different
question though.
This reminds me of the discussion about VMA-less UFFD that was coming up
several times, such as [1], but AFAIK hasn't materialised into something
actionable. I may be wrong, but James was looking into that and couldn't
figure out a way to scale it sufficiently for his use case and had to stick
with the VM-exit-based approach. Can you see a world where VM-exit
userfaults coexist with no-VM-exit way of handling async PFs?
The issue with UFFD is that it's difficult to provide a generic "point of contact",
whereas with KVM userfault, signalling can be tied to the vCPU, and KVM can provide
per-vCPU buffers/structures to aid communication.
That said, supporting "exitless" KVM userfault would most definitely be premature
optimization without strong evidence it would benefit a real world use case.
Does that mean that the "exitless" solution for async PF is a long-term
one (if required), while the short-term would still be "exitful" (if we
find a way to do it sensibly)?