On Thu, Mar 07, 2024 at 06:19:41PM -0800, Isaku Yamahata wrote: > On Thu, Mar 07, 2024 at 05:28:20PM -0800, > Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > On Thu, Mar 07, 2024, David Matlack wrote: > > > On 2024-03-08 01:20 PM, Huang, Kai wrote: > > > > > > > +:Parameters: struct kvm_memory_mapping(in/out) > > > > > > > +:Returns: 0 on success, <0 on error > > > > > > > + > > > > > > > +KVM_MAP_MEMORY populates guest memory without running vcpu. > > > > > > > + > > > > > > > +:: > > > > > > > + > > > > > > > + struct kvm_memory_mapping { > > > > > > > + __u64 base_gfn; > > > > > > > + __u64 nr_pages; > > > > > > > + __u64 flags; > > > > > > > + __u64 source; > > > > > > > + }; > > > > > > > + > > > > > > > + /* For kvm_memory_mapping:: flags */ > > > > > > > + #define KVM_MEMORY_MAPPING_FLAG_WRITE _BITULL(0) > > > > > > > + #define KVM_MEMORY_MAPPING_FLAG_EXEC _BITULL(1) > > > > > > > + #define KVM_MEMORY_MAPPING_FLAG_USER _BITULL(2) > > > > > > > > > > > > I am not sure what's the good of having "FLAG_USER"? > > > > > > > > > > > > This ioctl is called from userspace, thus I think we can just treat this always > > > > > > as user-fault? > > > > > > > > > > The point is how to emulate kvm page fault as if vcpu caused the kvm page > > > > > fault. Not we call the ioctl as user context. > > > > > > > > Sorry I don't quite follow. What's wrong if KVM just append the #PF USER > > > > error bit before it calls into the fault handler? > > > > > > > > My question is, since this is ABI, you have to tell how userspace is > > > > supposed to use this. Maybe I am missing something, but I don't see how > > > > USER should be used here. > > > > > > If we restrict this API to the TDP MMU then KVM_MEMORY_MAPPING_FLAG_USER > > > is meaningless, PFERR_USER_MASK is only relevant for shadow paging. > > > > +1 > > > > > KVM_MEMORY_MAPPING_FLAG_WRITE seems useful to allow memslots to be > > > populated with writes (which avoids just faulting in the zero-page for > > > anon or tmpfs backed memslots), while also allowing populating read-only > > > memslots. > > > > > > I don't really see a use-case for KVM_MEMORY_MAPPING_FLAG_EXEC. > > > > It would midly be interesting for something like the NX hugepage mitigation. > > > > For the initial implementation, I don't think the ioctl() should specify > > protections, period. > > > > VMA-based mappings, i.e. !guest_memfd, already have a way to specify protections. > > And for guest_memfd, finer grained control in general, and long term compatibility > > with other features that are in-flight or proposed, I would rather userspace specify > > RWX protections via KVM_SET_MEMORY_ATTRIBUTES. Oh, and dirty logging would be a > > pain too. > > > > KVM doesn't currently support execute-only (XO) or !executable (RW), so I think > > we can simply define KVM_MAP_MEMORY to behave like a read fault. E.g. map RX, > > and add W if all underlying protections allow it. > > > > That way we can defer dealing with things like XO and RW *if* KVM ever does gain > > support for specifying those combinations via KVM_SET_MEMORY_ATTRIBUTES, which > > will likely be per-arch/vendor and non-trivial, e.g. AMD's NPT doesn't even allow > > for XO memory. > > > > And we shouldn't need to do anything for KVM_MAP_MEMORY in particular if > > KVM_SET_MEMORY_ATTRIBUTES gains support for RWX protections the existing RWX and > > RX combinations, e.g. if there's a use-case for write-protecting guest_memfd > > regions. > > > > We can always expand the uAPI, but taking away functionality is much harder, if > > not impossible. > > Ok, let me drop all the flags. Here is the updated one. > > 4.143 KVM_MAP_MEMORY > ------------------------ > > :Capability: KVM_CAP_MAP_MEMORY > :Architectures: none > :Type: vcpu ioctl > :Parameters: struct kvm_memory_mapping(in/out) > :Returns: 0 on success, < 0 on error > > Errors: > > ====== ============================================================= > EINVAL vcpu state is not in TDP MMU mode or is in guest mode. > Currently, this ioctl is restricted to TDP MMU. > EAGAIN The region is only processed partially. The caller should > issue the ioctl with the updated parameters. > EINTR An unmasked signal is pending. The region may be processed > partially. If `nr_pages` > 0, the caller should issue the > ioctl with the updated parameters. > ====== ============================================================= > > KVM_MAP_MEMORY populates guest memory before the VM starts to run. Multiple > vcpus can call this ioctl simultaneously. It may result in the error of EAGAIN > due to race conditions. > > :: > > struct kvm_memory_mapping { > __u64 base_gfn; > __u64 nr_pages; > __u64 flags; > __u64 source; > }; > > KVM_MAP_MEMORY populates guest memory at the specified range (`base_gfn`, > `nr_pages`) in the underlying mapping. `source` is an optional user pointer. If > `source` is not NULL and the underlying technology supports it, the memory > contents of `source` are copied into the guest memory. The backend may encrypt > it. `flags` must be zero. It's reserved for future use. > > When the ioctl returns, the input values are updated. If `nr_pages` is large, > it may return EAGAIN or EINTR for pending signal and update the values > (`base_gfn` and `nr_pages`. `source` if not zero) to point to the remaining > range. If this intended to replace SNP_LAUNCH_UPDATE, then to be useable for SNP guests userspace also needs to pass along the type of page being added, which are currently defined as: #define KVM_SEV_SNP_PAGE_TYPE_NORMAL 0x1 #define KVM_SEV_SNP_PAGE_TYPE_ZERO 0x3 #define KVM_SEV_SNP_PAGE_TYPE_UNMEASURED 0x4 #define KVM_SEV_SNP_PAGE_TYPE_SECRETS 0x5 #define KVM_SEV_SNP_PAGE_TYPE_CPUID 0x6 So I guess the main question is, do bite the bullet now and introduce infrastructure for vendor-specific parameters, or should we attempt to define these as cross-vendor/cross-architecture types and hide the vendor-specific stuff from userspace? There are a couple other bits of vendor-specific information that would be needed to be a total drop-in replacement for SNP_LAUNCH_UPDATE, but I think these we could can do without: sev_fd: handle for /dev/sev which is used to issue SEV firmware calls as-needed for various KVM ioctls - can likely bind this to SNP context during SNP_LAUNCH_UPDATE and avoid needing to pass it in for subsequent calls error code: return parameter which passes SEV firmware error codes to userspace for informational purposes - can probably live without this -Mike > > -- > Isaku Yamahata <isaku.yamahata@xxxxxxxxxxxxxxx>