Re: [RFC PATCH 1/8] KVM: Document KVM_MAP_MEMORY ioctl

Michael Roth <michael.roth@xxxxxxx> · Sun, 10 Mar 2024 18:12:26 -0500

On Thu, Mar 07, 2024 at 06:19:41PM -0800, Isaku Yamahata wrote:
> On Thu, Mar 07, 2024 at 05:28:20PM -0800,
> Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> 
> > On Thu, Mar 07, 2024, David Matlack wrote:
> > > On 2024-03-08 01:20 PM, Huang, Kai wrote:
> > > > > > > +:Parameters: struct kvm_memory_mapping(in/out)
> > > > > > > +:Returns: 0 on success, <0 on error
> > > > > > > +
> > > > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu.
> > > > > > > +
> > > > > > > +::
> > > > > > > +
> > > > > > > +  struct kvm_memory_mapping {
> > > > > > > +	__u64 base_gfn;
> > > > > > > +	__u64 nr_pages;
> > > > > > > +	__u64 flags;
> > > > > > > +	__u64 source;
> > > > > > > +  };
> > > > > > > +
> > > > > > > +  /* For kvm_memory_mapping:: flags */
> > > > > > > +  #define KVM_MEMORY_MAPPING_FLAG_WRITE         _BITULL(0)
> > > > > > > +  #define KVM_MEMORY_MAPPING_FLAG_EXEC          _BITULL(1)
> > > > > > > +  #define KVM_MEMORY_MAPPING_FLAG_USER          _BITULL(2)
> > > > > > 
> > > > > > I am not sure what's the good of having "FLAG_USER"?
> > > > > > 
> > > > > > This ioctl is called from userspace, thus I think we can just treat this always
> > > > > > as user-fault?
> > > > > 
> > > > > The point is how to emulate kvm page fault as if vcpu caused the kvm page
> > > > > fault.  Not we call the ioctl as user context.
> > > > 
> > > > Sorry I don't quite follow.  What's wrong if KVM just append the #PF USER
> > > > error bit before it calls into the fault handler?
> > > > 
> > > > My question is, since this is ABI, you have to tell how userspace is
> > > > supposed to use this.  Maybe I am missing something, but I don't see how
> > > > USER should be used here.
> > > 
> > > If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER
> > > is meaningless, PFERR_USER_MASK is only relevant for shadow paging.
> > 
> > +1
> > 
> > > KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be
> > > populated with writes (which avoids just faulting in the zero-page for
> > > anon or tmpfs backed memslots), while also allowing populating read-only
> > > memslots.
> > > 
> > > I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC.
> > 
> > It would midly be interesting for something like the NX hugepage mitigation.
> > 
> > For the initial implementation, I don't think the ioctl() should specify
> > protections, period.
> > 
> > VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections.
> > And for guest_memfd, finer grained control in general, and long term compatibility
> > with other features that are in-flight or proposed, I would rather userspace specify
> > RWX protections via KVM_SET_MEMORY_ATTRIBUTES.  Oh, and dirty logging would be a
> > pain too.
> > 
> > KVM doesn't currently support execute-only (XO) or !executable (RW), so I think
> > we can simply define KVM_MAP_MEMORY to behave like a read fault.  E.g. map RX,
> > and add W if all underlying protections allow it.
> > 
> > That way we can defer dealing with things like XO and RW *if* KVM ever does gain
> > support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which
> > will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow
> > for XO memory.
> > 
> > And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if
> > KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and
> > RX combinations, e.g. if there's a use-case for write-protecting guest_memfd
> > regions.
> > 
> > We can always expand the uAPI, but taking away functionality is much harder, if
> > not impossible.
> 
> Ok, let me drop all the flags.  Here is the updated one.
> 
> 4.143 KVM_MAP_MEMORY
> ------------------------
> 
> :Capability: KVM_CAP_MAP_MEMORY
> :Architectures: none
> :Type: vcpu ioctl
> :Parameters: struct kvm_memory_mapping(in/out)
> :Returns: 0 on success, < 0 on error
> 
> Errors:
> 
>   ======   =============================================================
>   EINVAL   vcpu state is not in TDP MMU mode or is in guest mode.
>            Currently, this ioctl is restricted to TDP MMU.
>   EAGAIN   The region is only processed partially.  The caller should
>            issue the ioctl with the updated parameters.
>   EINTR    An unmasked signal is pending.  The region may be processed
>            partially.  If `nr_pages` > 0, the caller should issue the
>            ioctl with the updated parameters.
>   ======   =============================================================
> 
> KVM_MAP_MEMORY populates guest memory before the VM starts to run.  Multiple
> vcpus can call this ioctl simultaneously.  It may result in the error of EAGAIN
> due to race conditions.
> 
> ::
> 
>   struct kvm_memory_mapping {
> 	__u64 base_gfn;
> 	__u64 nr_pages;
> 	__u64 flags;
> 	__u64 source;
>   };
> 
> KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`,
> `nr_pages`) in the underlying mapping. `source` is an optional user pointer.  If
> `source` is not NULL and the underlying technology supports it, the memory
> contents of `source` are copied into the guest memory.  The backend may encrypt
> it.  `flags` must be zero.  It's reserved for future use.
> 
> When the ioctl returns, the input values are updated.  If `nr_pages` is large,
> it may return EAGAIN or EINTR for pending signal and update the values
> (`base_gfn` and `nr_pages`.  `source` if not zero) to point to the remaining
> range.

If this intended to replace SNP_LAUNCH_UPDATE, then to be useable for SNP
guests userspace also needs to pass along the type of page being added,
which are currently defined as:

  #define KVM_SEV_SNP_PAGE_TYPE_NORMAL            0x1
  #define KVM_SEV_SNP_PAGE_TYPE_ZERO              0x3
  #define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED        0x4
  #define KVM_SEV_SNP_PAGE_TYPE_SECRETS           0x5
  #define KVM_SEV_SNP_PAGE_TYPE_CPUID             0x6

So I guess the main question is, do bite the bullet now and introduce
infrastructure for vendor-specific parameters, or should we attempt to define
these as cross-vendor/cross-architecture types and hide the vendor-specific
stuff from userspace?

There are a couple other bits of vendor-specific information that would be
needed to be a total drop-in replacement for SNP_LAUNCH_UPDATE, but I think
these we could can do without:

  sev_fd: handle for /dev/sev which is used to issue SEV firmware calls
          as-needed for various KVM ioctls
          - can likely bind this to SNP context during SNP_LAUNCH_UPDATE and
            avoid needing to pass it in for subsequent calls
  error code: return parameter which passes SEV firmware error codes to
              userspace for informational purposes
              - can probably live without this

-Mike

> 
> -- 
> Isaku Yamahata <isaku.yamahata@xxxxxxxxxxxxxxx>