Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 8 Oct 2024 12:56:45 -0700

On Tue, Oct 08, 2024, Ackerley Tng wrote:
> Patrick Roy <roypat@xxxxxxxxxxxx> writes:
> > For the "non-CoCo with direct map entries removed" VMs that we at AWS
> > are going for, we'd like a VM type with host-controlled in-place
> > conversions which doesn't zero on transitions,

Hmm, your use case shouldn't need conversions _for KVM_, as there's no need for
KVM to care if userspace or the guest _wants_ a page to be shared vs. private.
Userspace is fully trusted to manage things; KVM simply reacts to the current
state of things.

And more importantly, whether or not the direct map is zapped needs to be a
property of the guest_memfd inode, i.e. can't be associated with a struct kvm.
I forget who got volunteered to do the work, but we're going to need similar
functionality for tracking the state of individual pages in a huge folio, as
folio_mark_uptodate() is too coarse-grained.  I.e. at some point, I expect that
guest_memfd will make it easy-ish to determine whether or not the direct map has
been obliterated.

The shared vs. private attributes tracking in KVM is still needed (I think), as
it communicates what userspace _wants_, whereas he guest_memfd machinery will
track what the state _is_.

> > so if KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new
> > VM type for that.

Maybe we should sneak in a s/KVM_X86_SW_PROTECTED_VM/KVM_X86_SW_HARDENED_VM rename?
The original thought behind "software protected VM" was to do a slow build of
something akin to pKVM, but realistically I don't think that idea is going anywhere.

Alternatively, depending on how KVM accesses guest memory that's been removed from
the direct map, another solution would be to allow "regular" VMs to bind memslots
to guest_memfd, i.e. if the non-CoCo use case needs/wnats to bind all memory to
guest_memfd, not just "private" mappings.

That's probably the biggest topic of discussion: how do we want to allow mapping
guest_memfd into the guest, without direct map entries, but while still allowing
KVM to access guest memory as needed, e.g. for shadow paging.  One approach is
your RFC, where KVM maps guest_memfd pfns on-demand.

Another (slightly crazy) approach would be use protection keys to provide the
security properties that you want, while giving KVM (and userspace) a quick-and-easy
override to access guest memory.

 1. mmap() guest_memfd into userpace with RW protections
 2. Configure PKRU to make guest_memfd memory inaccessible by default
 3. Swizzle PKRU on-demand when intentionally accessing guest memory

It's essentially the same idea as SMAP+STAC/CLAC, just applied to guest memory
instead of to usersepace memory.

The benefit of the PKRU approach is that there are no PTE modifications, and thus
no TLB flushes, and only the CPU that is access guest memory gains temporary
access.  The big downside is that it would be limited to modern hardware, but
that might be acceptable, especially if it simplifies KVM's implementation.

> > Somewhat related sidenote: For VMs that allow inplace conversions and do
> > not zero, we do not need to zap the stage-2 mappings on memory attribute
> > changes, right?

See above.  I don't think conversions by toggling the shared/private flag in
KVM's memory attributes is the right fit for your use case.