Hi, On Tue, Feb 14, 2023 at 9:26 PM Elliot Berman <quic_eberman@xxxxxxxxxxx> wrote: > > > When launching a virtual machine, Gunyah userspace allocates memory for > the guest and informs Gunyah about these memory regions through > SET_USER_MEMORY_REGION ioctl. I'm working on pKVM [1], and regarding the problem of donating private memory to a guest, we and others working on confidential computing have faced a similar issue that this patch is trying to address. In pKVM, we've initially taken an approach similar to the one here by pinning the pages being donated to prevent swapping or migration [2]. However, we've encountered issues with this approach since the memory is still mapped by the host, which could cause the system to crash on an errant access. Instead, we've been working on adopting an fd-based restricted memory approach that was initially proposed for TDX [3] and is now being considered by others in the confidential computing space as well (e.g., Arm CCA [4]). The basic idea is that the host manages the guest memory via a file descriptor instead of a userspace address. It cannot map that memory (unless explicitly shared by the guest [5]), eliminating the possibility of the host trying to access private memory accidentally or being tricked by a malicious actor. This is based on memfd with some restrictions. It handles swapping and migration by disallowing them (for now [6]), and adds a new type of memory region to KVM to accommodate having an fd representing guest memory. Although the fd-based restricted memory isn't upstream yet, we've ported the latest patches to arm64 and made changes and additions to make it work with pKVM, to test it and see if the solution is feasible for us (it is). I wanted to mention this work in case you find it useful, and in the hopes that we can all work on confidential computing using the same interfaces as much as possible. Some comments inline below... Cheers, /fuad [1] https://lore.kernel.org/kvmarm/20220519134204.5379-1-will@xxxxxxxxxx/ [2] https://lore.kernel.org/kvmarm/20220519134204.5379-34-will@xxxxxxxxxx/ [3] https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@xxxxxxxxxxxxxxx/ [4] https://lore.kernel.org/lkml/20230127112932.38045-1-steven.price@xxxxxxx/ [5] This is a modification we've done for the arm64 port, after discussing it with the original authors. [6] Nothing inherent in the proposal to stop migration and swapping. There are some technical issues that need to be resolved. <snip> > +int gh_vm_mem_alloc(struct gh_vm *ghvm, struct gh_userspace_memory_region *region) > +{ > + struct gh_vm_mem *mapping, *tmp_mapping; > + struct gh_rm_mem_entry *mem_entries; > + phys_addr_t curr_page, prev_page; > + struct gh_rm_mem_parcel *parcel; > + int i, j, pinned, ret = 0; > + size_t entry_size; > + u16 vmid; > + > + if (!gh_api_has_feature(GH_API_FEATURE_MEMEXTENT)) > + return -EOPNOTSUPP; > + > + if (!region->memory_size || !PAGE_ALIGNED(region->memory_size) || > + !PAGE_ALIGNED(region->userspace_addr) || !PAGE_ALIGNED(region->guest_phys_addr)) > + return -EINVAL; > + > + ret = mutex_lock_interruptible(&ghvm->mm_lock); > + if (ret) > + return ret; > + mapping = __gh_vm_mem_find(ghvm, region->label); > + if (mapping) { > + mutex_unlock(&ghvm->mm_lock); > + return -EEXIST; > + } > + > + mapping = kzalloc(sizeof(*mapping), GFP_KERNEL); > + if (!mapping) { > + ret = -ENOMEM; > + goto free_mapping; > + } > + > + mapping->parcel.label = region->label; > + mapping->guest_phys_addr = region->guest_phys_addr; > + mapping->npages = region->memory_size >> PAGE_SHIFT; > + parcel = &mapping->parcel; > + parcel->mem_handle = GH_MEM_HANDLE_INVAL; /* to be filled later by mem_share/mem_lend */ > + parcel->mem_type = GH_RM_MEM_TYPE_NORMAL; > + > + /* Check for overlap */ > + list_for_each_entry(tmp_mapping, &ghvm->memory_mappings, list) { > + if (!((mapping->guest_phys_addr + (mapping->npages << PAGE_SHIFT) <= > + tmp_mapping->guest_phys_addr) || > + (mapping->guest_phys_addr >= > + tmp_mapping->guest_phys_addr + (tmp_mapping->npages << PAGE_SHIFT)))) { > + ret = -EEXIST; > + goto free_mapping; > + } > + } > + > + list_add(&mapping->list, &ghvm->memory_mappings); > + > + mapping->pages = kcalloc(mapping->npages, sizeof(*mapping->pages), GFP_KERNEL); > + if (!mapping->pages) { > + ret = -ENOMEM; > + mapping->npages = 0; /* update npages for reclaim */ > + goto reclaim; > + } These pages should be accounted for as locked pages, e.g., account_locked_vm(), which would also ensure that the process hasn't reached its limit. > + pinned = pin_user_pages_fast(region->userspace_addr, mapping->npages, > + FOLL_WRITE | FOLL_LONGTERM, mapping->pages); It might be good to check and avoid donating pages with pre-faulted file mappings since it might trigger a writeback of a page after losing access to it. Ideally, you want to only accept anonymous or shmem pages. In pKVM, we check that the pages are SwapBacked and reject the pinning/donation otherwise [2]. > + if (pinned < 0) { > + ret = pinned; > + mapping->npages = 0; /* update npages for reclaim */ > + goto reclaim; > + } else if (pinned != mapping->npages) { > + ret = -EFAULT; > + mapping->npages = pinned; /* update npages for reclaim */ > + goto reclaim; > + } > + > + if (region->flags & GH_MEM_LENT) { > + parcel->n_acl_entries = 1; > + mapping->share_type = VM_MEM_LEND; > + } else { > + parcel->n_acl_entries = 2; > + mapping->share_type = VM_MEM_SHARE; > + } > + parcel->acl_entries = kcalloc(parcel->n_acl_entries, sizeof(*parcel->acl_entries), > + GFP_KERNEL); > + if (!parcel->acl_entries) { > + ret = -ENOMEM; > + goto reclaim; > + } > + > + parcel->acl_entries[0].vmid = cpu_to_le16(ghvm->vmid); > + if (region->flags & GH_MEM_ALLOW_READ) > + parcel->acl_entries[0].perms |= GH_RM_ACL_R; > + if (region->flags & GH_MEM_ALLOW_WRITE) > + parcel->acl_entries[0].perms |= GH_RM_ACL_W; > + if (region->flags & GH_MEM_ALLOW_EXEC) > + parcel->acl_entries[0].perms |= GH_RM_ACL_X; > + > + if (mapping->share_type == VM_MEM_SHARE) { > + ret = gh_rm_get_vmid(ghvm->rm, &vmid); > + if (ret) > + goto reclaim; > + > + parcel->acl_entries[1].vmid = cpu_to_le16(vmid); > + /* Host assumed to have all these permissions. Gunyah will not > + * grant new permissions if host actually had less than RWX > + */ > + parcel->acl_entries[1].perms |= GH_RM_ACL_R | GH_RM_ACL_W | GH_RM_ACL_X; > + } > + > + mem_entries = kcalloc(mapping->npages, sizeof(*mem_entries), GFP_KERNEL); > + if (!mem_entries) { > + ret = -ENOMEM; > + goto reclaim; > + } > + > + /* reduce number of entries by combining contiguous pages into single memory entry */ > + prev_page = page_to_phys(mapping->pages[0]); > + mem_entries[0].ipa_base = cpu_to_le64(prev_page); > + entry_size = PAGE_SIZE; > + for (i = 1, j = 0; i < mapping->npages; i++) { > + curr_page = page_to_phys(mapping->pages[i]); > + if (page_contiguous(prev_page, curr_page)) { > + entry_size += PAGE_SIZE; > + } else { > + mem_entries[j].size = cpu_to_le64(entry_size); > + j++; > + mem_entries[j].ipa_base = cpu_to_le64(curr_page); > + entry_size = PAGE_SIZE; > + } > + > + prev_page = curr_page; > + } > + mem_entries[j].size = cpu_to_le64(entry_size); > + > + parcel->n_mem_entries = j + 1; > + parcel->mem_entries = kmemdup(mem_entries, sizeof(*mem_entries) * parcel->n_mem_entries, > + GFP_KERNEL); > + kfree(mem_entries); > + if (!parcel->mem_entries) { > + ret = -ENOMEM; > + goto reclaim; > + } > + > + mutex_unlock(&ghvm->mm_lock); > + return 0; > +reclaim: > + gh_vm_mem_reclaim(ghvm, mapping); > +free_mapping: > + kfree(mapping); > + mutex_unlock(&ghvm->mm_lock); > + return ret; > +} > + > +int gh_vm_mem_free(struct gh_vm *ghvm, u32 label) > +{ > + struct gh_vm_mem *mapping; > + int ret; > + > + ret = mutex_lock_interruptible(&ghvm->mm_lock); > + if (ret) > + return ret; > + > + mapping = __gh_vm_mem_find(ghvm, label); > + if (!mapping) > + goto out; > + > + gh_vm_mem_reclaim(ghvm, mapping); > + kfree(mapping); > +out: > + mutex_unlock(&ghvm->mm_lock); > + return ret; > +} > diff --git a/include/uapi/linux/gunyah.h b/include/uapi/linux/gunyah.h > index 10ba32d2b0a6..d85d12119a48 100644 > --- a/include/uapi/linux/gunyah.h > +++ b/include/uapi/linux/gunyah.h > @@ -20,4 +20,37 @@ > */ > #define GH_CREATE_VM _IO(GH_IOCTL_TYPE, 0x0) /* Returns a Gunyah VM fd */ > > +/* > + * ioctls for VM fds > + */ > + > +/** > + * struct gh_userspace_memory_region - Userspace memory descripion for GH_VM_SET_USER_MEM_REGION nit: s/descripion/description > + * @label: Unique identifer to the region. nit: s/identifer/identifier > + * @flags: Flags for memory parcel behavior > + * @guest_phys_addr: Location of the memory region in guest's memory space (page-aligned) > + * @memory_size: Size of the region (page-aligned) > + * @userspace_addr: Location of the memory region in caller (userspace)'s memory > + * > + * See Documentation/virt/gunyah/vm-manager.rst for further details. > + */ > +struct gh_userspace_memory_region { > + __u32 label; > +#define GH_MEM_ALLOW_READ (1UL << 0) > +#define GH_MEM_ALLOW_WRITE (1UL << 1) > +#define GH_MEM_ALLOW_EXEC (1UL << 2) > +/* > + * The guest will be lent the memory instead of shared. > + * In other words, the guest has exclusive access to the memory region and the host loses access. > + */ > +#define GH_MEM_LENT (1UL << 3) > + __u32 flags; > + __u64 guest_phys_addr; > + __u64 memory_size; > + __u64 userspace_addr; > +}; > + > +#define GH_VM_SET_USER_MEM_REGION _IOW(GH_IOCTL_TYPE, 0x1, \ > + struct gh_userspace_memory_region) > + > #endif > -- > 2.39.1 > > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@xxxxxxxxxxxxxxxxxxx > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel