On Fri, Apr 22, 2022 at 06:56:12PM +0800, Chao Peng wrote: > Great thanks for the discussions. I summarized the requirements/gaps and the > potential changes for next step. Please help to review. Hi Chao, Thanks for writing this up. I've been meaning to respond, but wanted to make a bit more progress with SNP+UPM prototype to get a better idea of what's needed on that end. I've needed to make some changes on the KVM and QEMU side to get things working so hopefully with your proposed rework those changes can be dropped. > > > Terminologies: > -------------- > - memory conversion: the action of converting guest memory between private > and shared. > - explicit conversion: an enlightened guest uses a hypercall to explicitly > request a memory conversion to VMM. > - implicit conversion: the conversion when VMM reacts to a page fault due > to different guest/host memory attributes (private/shared). > - destructive conversion: the memory content is lost/destroyed during > conversion. > - non-destructive conversion: the memory content is preserved during > conversion. > > > Requirements & Gaps > ------------------------------------- > - Confidential computing(CC): TDX/SEV/CCA > * Need support both explicit/implicit conversions. > * Need support only destructive conversion at runtime. > * The current patch should just work, but prefer to have pre-boot guest > payload/firmware population into private memory for performance. Not just performance in the case of SEV, it's needed there because firmware only supports in-place encryption of guest memory, there's no mechanism to provide a separate buffer to load into guest memory at pre-boot time. I think you're aware of this but wanted to point that out just in case. > > - pKVM > * Support explicit conversion only. Hard to achieve implicit conversion, > does not record the guest access info (private/shared) in page fault, > also makes little sense. > * Expect to support non-destructive conversion at runtime. Additionally > in-place conversion (the underlying physical page is unchanged) is > desirable since copy is not disirable. The current destructive conversion > does not fit well. > * The current callbacks between mm/KVM is useful and reusable for pKVM. > * Pre-boot guest payload population is nice to have. > > > Change Proposal > --------------- > Since there are some divergences for pKVM from CC usages and at this time looks > whether we will and how we will support pKVM with this private memory patchset > is still not quite clear, so this proposal does not imply certain detailed pKVM > implementation. But from the API level, we want this can be possible to be future > extended for pKVM or other potential usages. > > - No new user APIs introduced for memory backing store, e.g. remove the > current MFD_INACCESSIBLE. This info will be communicated from memfile_notifier > consumers to backing store via the new 'flag' field in memfile_notifier > described below. At creation time, the fd is normal shared fd. At rumtime CC > usages will keep using current fallocate/FALLOC_FL_PUNCH_HOLE to do the > conversion, but pKVM may also possible use a different way (e.g. rely on > mmap/munmap or mprotect as discussed). These are all not new APIs anyway. For SNP most of the explicit conversions are via GHCB page-state change requests. Each of these PSC requests can request shared/private conversions for up to 252 individual pages, along with whether or not they should be treated as 4K or 2M pages. Currently, like with KVM_EXIT_MEMORY_ERROR, these requests get handled in userspace and call back into the kernel via fallocate/PUNCH_HOLE calls. For each fallocate(), we need to update the RMP table to mark a page as private, and for PUNCH_HOLE we need to mark it as shared (otherwise it would be freed back to the host as guest-owned/private and cause a crash if the host tries to re-use it for something). I needed to add some callbacks to the memfile_notifier to handle these RMP table updates. There might be some other bits of book-keeping like clflush's, and adding/removing guest pages from the kernel's direct map. Not currently implemented, but the guest can also issue requests to "smash"/"unsmash" a 2M private range into individual 4K private ranges (generally in advance of flipping one of the pages to shared, or vice-versa) in the RMP table. Hypervisor code tries to handle this automatically, by determining when to smash/unsmash on it's own, but... I'm wondering how all these things can be properly conveyed through this fallocate/PUNCH_HOLE interface if we ever needed to add support for all of this, as it seems a bit restrictive as-is. For instance, with the current approach, one possible scheme is: - explicit conversion of shared->private for 252 4K pages: - we could do 252 individual fallocate()'s of 4K each, and make sure the kernel code will do notifier callbacks / RMP updates for each individual 4K page - shared->private for 252 2M pages: - we could do 252 individual fallocate()'s of 2M each, and make sure the kernel code will do notifier callbacks / RMP updates for each individual 2M page But for SNP most of these bulk PSC changes are when the guest switches *all* of it's pages from shared->private during early boot when it validates all of it's memory. So these pages tend to be contiguous ranges, and a nice optimization would be to coalesce these 252 fallocate() calls into a single fallocate() that spans the whole range. But there's no good way to do that without losing information like whether these should be treated as individual 4K vs. 2M ranges. So I wonder, since there's talk of the "binding" of this memfd to KVM being what actually enabled all the private/shared operations, if we should introduce some sort of new KVM ioctl, like KVM_UPM_SET_PRIVATE/SHARED, that could handle all the fallocate/hole-punching on the kernel side for larger GFN ranges to reduce the kernel<->userspace transitions, and allow for 4K/2M granularity to be specified as arguments, and maybe provide for better backward-compatibility vs. future changes to memfd backend interface. > > - Add a flag to memfile_notifier so its consumers can state the requirements. > > struct memfile_notifier { > struct list_head list; > unsigned long flags; /* consumer states its requirements here */ > struct memfile_notifier_ops *ops; /* future function may also extend ops when necessary */ > }; > > For current CC usage, we can define and set below flags from KVM. > > /* memfile notifier flags */ > #define MFN_F_USER_INACCESSIBLE 0x0001 /* memory allocated in the file is inaccessible from userspace (e.g. read/write/mmap) */ > #define MFN_F_UNMOVABLE 0x0002 /* memory allocated in the file is unmovable */ > #define MFN_F_UNRECLAIMABLE 0x0003 /* memory allocated in the file is unreclaimable (e.g. via kswapd or any other pathes) */ > > When memfile_notifier is being registered, memfile_register_notifier will > need check these flags. E.g. for MFN_F_USER_INACCESSIBLE, it fails when > previous mmap-ed mapping exists on the fd (I'm still unclear on how to do > this). When multiple consumers are supported it also need check all > registered consumers to see if any conflict (e.g. all consumers should have > MFN_F_USER_INACCESSIBLE set). Only when the register succeeds, the fd is > converted into a private fd, before that, the fd is just a normal (shared) > one. During this conversion, the previous data is preserved so you can put > some initial data in guest pages (whether the architecture allows this is > architecture-specific and out of the scope of this patch). > > - Pre-boot guest payload populating is done by normal mmap/munmap on the fd > before it's converted into private fd when KVM registers itself to the > backing store. Is that registration still intended to be triggered by KVM_SET_USER_MEMORY_REGION, or is there a new ioctl you're considering? I ask because in the case of SNP (and QEMU in general, maybe other VMMs), the regions are generally registered before the guest contents are initialized. So if KVM_SET_USER_MEMORY_REGION kicks of the conversion then it's too late for the SNP code in QEMU to populate the pre-conversion data. Maybe, building on the above approach, we could have something like: KVM_SET_USER_MEMORY_REGION KVM_UPM_BIND(TYPE_TDX|SEV|SNP, gfn_start, gfn_end) <populate guest memory> KVM_UPM_INIT(gfn_start, gfn_end) //not sure if needed KVM_UPM_SET_PRIVATE(gfn_start, gfn_end, granularity) <launch guest> ... KVM_UPM_SET_PRIVATE(gfn_start, gfn_end, granularity) ... KVM_UPM_SET_SHARED(gfn_start, gfn_end, granularity) etc. Just some rough ideas, but I think addressing these in some form would help a lot with getting SNP covered with reasonable performance. > > - Implicit conversion: maybe it's worthy to discuss again: how about totally > remove implicit converion support? TDX should be OK, unsure SEV/CCA. pKVM > should be happy to see. Removing also makes the work much easier and prevents > guest bugs/unitended behaviors early. If it turns out that there is reason to > keep it, then for pKVM we can make it an optional feature (e.g. via a new > module param). But that can be added when pKVM really gets supported. SEV sort of relies on implicit conversion since the guest is free to turn on/off the encryption bit during run-time. But in the context of UPM that wouldn't be supported anyway since, IIUC, the idea is that SEV/SEV-ES would only be supported for guests that do explicit conversions via MAP_GPA_RANGE hypercall. And for SNP these would similarly be done via explicit page-state change requests via GHCB requests issued by the guest. But if possible, it would be nice if we could leave implicit conversion as an optional feature/flag, as it's something that we considered harmless for the guest SNP support (now upstream), and planned to allow in the hypervisor implementation. I don't think we intentionally relied on it in the guest kernel/uefi support, but I need to audit that code to be sure that dropping it wouldn't cause a regression in the guest support. I'll try to confirm this soon one I get things running under UPM a bit more reliably. > > - non-destructive in-place conversion: Out of scope for this series, pKVM can > invent other pKVM specific interfaces (either extend memfile_notifier and using > mmap/mprotect or use totaly different ways like access through vmfd as Sean > suggested). > > Thanks, > Chao Also, happy to help with testing things on the SNP side going forward, just let me know. Thanks! -Mike