Re: [RFC PATCH 0/6] KVM: x86: async PF user

Nikita Kalyazin <kalyazin@xxxxxxxxxx> · Thu, 27 Feb 2025 18:24:05 +0000

On 27/02/2025 16:44, Sean Christopherson wrote:
On Wed, Feb 26, 2025, Nikita Kalyazin wrote:
On 26/02/2025 00:58, Sean Christopherson wrote:
On Fri, Feb 21, 2025, Nikita Kalyazin wrote:
On 20/02/2025 18:49, Sean Christopherson wrote:
On Thu, Feb 20, 2025, Nikita Kalyazin wrote:
On 19/02/2025 15:17, Sean Christopherson wrote:
On Wed, Feb 12, 2025, Nikita Kalyazin wrote:
The conundrum with userspace async #PF is that if userspace is given only a single
bit per gfn to force an exit, then KVM won't be able to differentiate between
"faults" that will be handled synchronously by the vCPU task, and faults that
usersepace will hand off to an I/O task.  If the fault is handled synchronously,
KVM will needlessly inject a not-present #PF and a present IRQ.

Right, but from the guest's point of view, async PF means "it will probably
take a while for the host to get the page, so I may consider doing something
else in the meantime (ie schedule another process if available)".

Except in this case, the guest never gets a chance to run, i.e. it can't do
something else.  From the guest point of view, if KVM doesn't inject what is
effectively a spurious async #PF, the VM-Exiting instruction simply took a (really)
long time to execute.

Sorry, I didn't get that.  If userspace learns from the
kvm_run::memory_fault::flags that the exit is due to an async PF, it should
call kvm run immediately, inject the not-present PF and allow the guest to
reschedule.  What do you mean by "the guest never gets a chance to run"?

What I'm saying is that, as proposed, the API doesn't precisely tell userspace
                                                                          ^^^^^^^^^
                                                                          KVM
an exit happened due to an "async #PF".  KVM has absolutely zero clue as to
whether or not userspace is going to do an async #PF, or if userspace wants to
intercept the fault for some entirely different purpose.

Userspace is supposed to know whether the PF is async from the dedicated
flag added in the memory_fault structure:
KVM_MEMORY_EXIT_FLAG_ASYNC_PF_USER.  It will be set when KVM managed to
inject page-not-present.  Are you saying it isn't sufficient?

Gah, sorry, typo.  The API doesn't tell *KVM* that userfault exit is due to an
async #PF.

Unless the remote page was already requested, e.g. by a different vCPU, or by a
prefetching algorithim.

Conversely, if the page content is available, it must have already been
prepopulated into guest memory pagecache, the bit in the bitmap is cleared
and no exit to userspace occurs.

But that doesn't happen instantaneously.  Even if the VMM somehow atomically
receives the page and marks it present, it's still possible for marking the page
present to race with KVM checking the bitmap.

That looks like a generic problem of the VM-exit fault handling.  Eg when

Heh, it's a generic "problem" for faults in general.  E.g. modern x86 CPUs will
take "spurious" page faults on write accesses if a PTE is writable in memory but
the CPU has a read-only mapping cached in its TLB.

It's all a matter of cost.  E.g. pre-Nehalem Intel CPUs didn't take such spurious
read-only faults as they would re-walk the in-memory page tables, but that ended
up being a net negative because the cost of re-walking for all read-only faults
outweighed the benefits of avoiding spurious faults in the unlikely scenario the
fault had already been fixed.

For a spurious async #PF + IRQ, the cost could be signficant, e.g. due to causing
unwanted context switches in the guest, in addition to the raw overhead of the
faults, interrupts, and exits.

one vCPU exits, userspace handles the fault and races setting the bitmap
with another vCPU that is about to fault the same page, which may cause a
spurious exit.

On the other hand, is it malignant?  The only downside is additional
overhead of the async PF protocol, but if the race occurs infrequently, it
shouldn't be a problem.

When it comes to uAPI, I want to try and avoid statements along the lines of
"IF 'x' holds true, then 'y' SHOULDN'T be a problem".  If this didn't impact uAPI,
I wouldn't care as much, i.e. I'd be much more willing iterate as needed.

I'm not saying we should go straight for a complex implementation.  Quite the
opposite.  But I do want us to consider the possible ramifications of using a
single bit for all userfaults, so that we can at least try to design something
that is extensible and won't be a pain to maintain.

So you would've liked more the "two-bit per gfn" approach as in: provide 
2 interception points, for sync and async exits, with the former chosen 
by userspace when it "knows" that the content is already in memory? 
What makes it a conundrum then?  It looks like an incremental change to 
what has already been proposed.  There is a complication that 2-bit 
operations aren't atomic, but even 1 bit is racy between KVM and userspace.