Re: [PATCH v5 00/13] KVM: mm: fd-based approach for supporting KVM guest private memory

Michael Roth <michael.roth@xxxxxxx> · Mon, 9 May 2022 17:30:56 -0500

On Fri, Apr 22, 2022 at 06:56:12PM +0800, Chao Peng wrote:
> Great thanks for the discussions. I summarized the requirements/gaps and the
> potential changes for next step. Please help to review.

Hi Chao,

Thanks for writing this up. I've been meaning to respond, but wanted to
make a bit more progress with SNP+UPM prototype to get a better idea of
what's needed on that end. I've needed to make some changes on the KVM
and QEMU side to get things working so hopefully with your proposed
rework those changes can be dropped.

> 
> 
> Terminologies:
> --------------
>   - memory conversion: the action of converting guest memory between private
>     and shared.
>   - explicit conversion: an enlightened guest uses a hypercall to explicitly
>     request a memory conversion to VMM.
>   - implicit conversion: the conversion when VMM reacts to a page fault due
>     to different guest/host memory attributes (private/shared).
>   - destructive conversion: the memory content is lost/destroyed during
>     conversion.
>   - non-destructive conversion: the memory content is preserved during
>     conversion.
> 
> 
> Requirements & Gaps
> -------------------------------------
>   - Confidential computing(CC): TDX/SEV/CCA
>     * Need support both explicit/implicit conversions.
>     * Need support only destructive conversion at runtime.
>     * The current patch should just work, but prefer to have pre-boot guest
>       payload/firmware population into private memory for performance.

Not just performance in the case of SEV, it's needed there because firmware
only supports in-place encryption of guest memory, there's no mechanism to
provide a separate buffer to load into guest memory at pre-boot time. I
think you're aware of this but wanted to point that out just in case.

> 
>   - pKVM
>     * Support explicit conversion only. Hard to achieve implicit conversion,
>       does not record the guest access info (private/shared) in page fault,
>       also makes little sense.
>     * Expect to support non-destructive conversion at runtime. Additionally
>       in-place conversion (the underlying physical page is unchanged) is
>       desirable since copy is not disirable. The current destructive conversion
>       does not fit well.
>     * The current callbacks between mm/KVM is useful and reusable for pKVM.
>     * Pre-boot guest payload population is nice to have.
> 
> 
> Change Proposal
> ---------------
> Since there are some divergences for pKVM from CC usages and at this time looks
> whether we will and how we will support pKVM with this private memory patchset
> is still not quite clear, so this proposal does not imply certain detailed pKVM
> implementation. But from the API level, we want this can be possible to be future
> extended for pKVM or other potential usages.
> 
>   - No new user APIs introduced for memory backing store, e.g. remove the
>     current MFD_INACCESSIBLE. This info will be communicated from memfile_notifier
>     consumers to backing store via the new 'flag' field in memfile_notifier
>     described below. At creation time, the fd is normal shared fd. At rumtime CC
>     usages will keep using current fallocate/FALLOC_FL_PUNCH_HOLE to do the
>     conversion, but pKVM may also possible use a different way (e.g. rely on
>     mmap/munmap or mprotect as discussed). These are all not new APIs anyway.

For SNP most of the explicit conversions are via GHCB page-state change
requests. Each of these PSC requests can request shared/private
conversions for up to 252 individual pages, along with whether or not
they should be treated as 4K or 2M pages. Currently, like with
KVM_EXIT_MEMORY_ERROR, these requests get handled in userspace and call
back into the kernel via fallocate/PUNCH_HOLE calls.

For each fallocate(), we need to update the RMP table to mark a page as
private, and for PUNCH_HOLE we need to mark it as shared (otherwise it
would be freed back to the host as guest-owned/private and cause a crash if
the host tries to re-use it for something). I needed to add some callbacks
to the memfile_notifier to handle these RMP table updates. There might be
some other bits of book-keeping like clflush's, and adding/removing guest
pages from the kernel's direct map.

Not currently implemented, but the guest can also issue requests to
"smash"/"unsmash" a 2M private range into individual 4K private ranges
(generally in advance of flipping one of the pages to shared, or
vice-versa) in the RMP table. Hypervisor code tries to handle this
automatically, by determining when to smash/unsmash on it's own, but...

I'm wondering how all these things can be properly conveyed through this
fallocate/PUNCH_HOLE interface if we ever needed to add support for all
of this, as it seems a bit restrictive as-is. For instance, with the
current approach, one possible scheme is:

  - explicit conversion of shared->private for 252 4K pages:
    - we could do 252 individual fallocate()'s of 4K each, and make sure the
      kernel code will do notifier callbacks / RMP updates for each individual
      4K page

  - shared->private for 252 2M pages:
    - we could do 252 individual fallocate()'s of 2M each, and make sure the
      kernel code will do notifier callbacks / RMP updates for each individual
      2M page

But for SNP most of these bulk PSC changes are when the guest switches
*all* of it's pages from shared->private during early boot when it
validates all of it's memory. So these pages tend to be contiguous
ranges, and a nice optimization would be to coalesce these 252
fallocate() calls into a single fallocate() that spans the whole range.
But there's no good way to do that without losing information like
whether these should be treated as individual 4K vs. 2M ranges.

So I wonder, since there's talk of the "binding" of this memfd to KVM
being what actually enabled all the private/shared operations, if we
should introduce some sort of new KVM ioctl, like
KVM_UPM_SET_PRIVATE/SHARED, that could handle all the
fallocate/hole-punching on the kernel side for larger GFN ranges to reduce
the kernel<->userspace transitions, and allow for 4K/2M granularity to be
specified as arguments, and maybe provide for better
backward-compatibility vs. future changes to memfd backend interface.

> 
>   - Add a flag to memfile_notifier so its consumers can state the requirements.
> 
>         struct memfile_notifier {
>                 struct list_head list;
>                 unsigned long flags;     /* consumer states its requirements here */
>                 struct memfile_notifier_ops *ops; /* future function may also extend ops when necessary */
>         };
> 
>     For current CC usage, we can define and set below flags from KVM.
> 
>         /* memfile notifier flags */
>         #define MFN_F_USER_INACCESSIBLE   0x0001  /* memory allocated in the file is inaccessible from userspace (e.g. read/write/mmap) */
>         #define MFN_F_UNMOVABLE           0x0002  /* memory allocated in the file is unmovable */
>         #define MFN_F_UNRECLAIMABLE       0x0003  /* memory allocated in the file is unreclaimable (e.g. via kswapd or any other pathes) */
> 
>     When memfile_notifier is being registered, memfile_register_notifier will
>     need check these flags. E.g. for MFN_F_USER_INACCESSIBLE, it fails when
>     previous mmap-ed mapping exists on the fd (I'm still unclear on how to do
>     this). When multiple consumers are supported it also need check all
>     registered consumers to see if any conflict (e.g. all consumers should have
>     MFN_F_USER_INACCESSIBLE set). Only when the register succeeds, the fd is
>     converted into a private fd, before that, the fd is just a normal (shared)
>     one. During this conversion, the previous data is preserved so you can put
>     some initial data in guest pages (whether the architecture allows this is
>     architecture-specific and out of the scope of this patch).
> 
>   - Pre-boot guest payload populating is done by normal mmap/munmap on the fd
>     before it's converted into private fd when KVM registers itself to the
>     backing store.

Is that registration still intended to be triggered by
KVM_SET_USER_MEMORY_REGION, or is there a new ioctl you're considering?

I ask because in the case of SNP (and QEMU in general, maybe other VMMs),
the regions are generally registered before the guest contents are
initialized. So if KVM_SET_USER_MEMORY_REGION kicks of the conversion then
it's too late for the SNP code in QEMU to populate the pre-conversion data.

Maybe, building on the above approach, we could have something like:

KVM_SET_USER_MEMORY_REGION
KVM_UPM_BIND(TYPE_TDX|SEV|SNP, gfn_start, gfn_end)
<populate guest memory>
KVM_UPM_INIT(gfn_start, gfn_end) //not sure if needed
KVM_UPM_SET_PRIVATE(gfn_start, gfn_end, granularity)
<launch guest>
...
KVM_UPM_SET_PRIVATE(gfn_start, gfn_end, granularity)
...
KVM_UPM_SET_SHARED(gfn_start, gfn_end, granularity)
etc.

Just some rough ideas, but I think addressing these in some form would help
a lot with getting SNP covered with reasonable performance.

> 
>   - Implicit conversion: maybe it's worthy to discuss again: how about totally
>     remove implicit converion support? TDX should be OK, unsure SEV/CCA. pKVM
>     should be happy to see. Removing also makes the work much easier and prevents
>     guest bugs/unitended behaviors early. If it turns out that there is reason to
>     keep it, then for pKVM we can make it an optional feature (e.g. via a new
>     module param). But that can be added when pKVM really gets supported.

SEV sort of relies on implicit conversion since the guest is free to turn
on/off the encryption bit during run-time. But in the context of UPM that
wouldn't be supported anyway since, IIUC, the idea is that SEV/SEV-ES would
only be supported for guests that do explicit conversions via MAP_GPA_RANGE
hypercall. And for SNP these would similarly be done via explicit page-state
change requests via GHCB requests issued by the guest.

But if possible, it would be nice if we could leave implicit conversion
as an optional feature/flag, as it's something that we considered
harmless for the guest SNP support (now upstream), and planned to allow
in the hypervisor implementation. I don't think we intentionally relied on
it in the guest kernel/uefi support, but I need to audit that code to be
sure that dropping it wouldn't cause a regression in the guest support.
I'll try to confirm this soon one I get things running under UPM a bit more
reliably.

> 
>   - non-destructive in-place conversion: Out of scope for this series, pKVM can
>     invent other pKVM specific interfaces (either extend memfile_notifier and using
>     mmap/mprotect or use totaly different ways like access through vmfd as Sean
>     suggested).
> 
> Thanks,
> Chao

Also, happy to help with testing things on the SNP side going forward, just
let me know.

Thanks!

-Mike