Hi, On Mon, Sep 26, 2022 at 3:28 PM Chao Peng <chao.p.peng@xxxxxxxxxxxxxxx> wrote: > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote: > > > Regarding pKVM's use case, with the shim approach I believe this can be done by > > > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions > > > piled on top. > > > > > > My first thought was to make the uAPI a set of KVM ioctls so that KVM could tightly > > > tightly control usage without taking on too much complexity in the kernel, but > > > working through things, routing the behavior through the shim itself might not be > > > all that horrific. > > > > > > IIRC, we discarded the idea of allowing userspace to map the "private" fd because > > > things got too complex, but with the shim it doesn't seem _that_ bad. > > > > > > E.g. on the memfd side: > > > > > > 1. The entire memfd must be mapped, and at most one mapping is allowed, i.e. > > > mapping is all or nothing. > > > > > > 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping for > > > the restricted memfd. > > > > > > 3. Add notifier hooks to allow downstream users to further restrict things. > > > > > > 4. Disallow splitting VMAs, e.g. to force userspace to munmap() everything in > > > one shot. > > > > > > 5. Require that there are no outstanding references at munmap(). Or if this > > > can't be guaranteed by userspace, maybe add some way for userspace to wait > > > until it's ok to convert to private? E.g. so that get_pfn() doesn't need > > > to do an expensive check every time. > > > > > > static int memfd_restricted_mmap(struct file *file, struct vm_area_struct *vma) > > > { > > > if (vma->vm_pgoff) > > > return -EINVAL; > > > > > > if ((vma->vm_end - vma->vm_start) != <file size>) > > > return -EINVAL; > > > > > > mutex_lock(&data->lock); > > > > > > if (data->has_mapping) { > > > r = -EINVAL; > > > goto err; > > > } > > > list_for_each_entry(notifier, &data->notifiers, list) { > > > r = notifier->ops->mmap_start(notifier, ...); > > > if (r) > > > goto abort; > > > } > > > > > > notifier->ops->mmap_end(notifier, ...); > > > mutex_unlock(&data->lock); > > > return 0; > > > > > > abort: > > > list_for_each_entry_continue_reverse(notifier &data->notifiers, list) > > > notifier->ops->mmap_abort(notifier, ...); > > > err: > > > mutex_unlock(&data->lock); > > > return r; > > > } > > > > > > static void memfd_restricted_close(struct vm_area_struct *vma) > > > { > > > mutex_lock(...); > > > > > > /* > > > * Destroy the memfd and disable all future accesses if there are > > > * outstanding refcounts (or other unsatisfied restrictions?). > > > */ > > > if (<outstanding references> || ???) > > > memfd_restricted_destroy(...); > > > else > > > data->has_mapping = false; > > > > > > mutex_unlock(...); > > > } > > > > > > static int memfd_restricted_may_split(struct vm_area_struct *area, unsigned long addr) > > > { > > > return -EINVAL; > > > } > > > > > > static int memfd_restricted_mapping_mremap(struct vm_area_struct *new_vma) > > > { > > > return -EINVAL; > > > } > > > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would: > > > > > > 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero > > > memory into the guest (after pre-boot phase). > > > > > > 2. Be mutually exclusive with shared<=>private conversions, and is allowed if > > > and only if the entire gfn range of the associated memslot is shared. > > > > In general I think that this would work with pKVM. However, limiting > > private<->shared conversions to the granularity of a whole memslot > > might be difficult to handle in pKVM, since the guest doesn't have the > > concept of memslots. For example, in pKVM right now, when a guest > > shares back its restricted DMA pool with the host it does so at the > > page-level. pKVM would also need a way to make an fd accessible again > > when shared back, which I think isn't possible with this patch. > > But does pKVM really want to mmap/munmap a new region at the page-level, > that can cause VMA fragmentation if the conversion is frequent as I see. > Even with a KVM ioctl for mapping as mentioned below, I think there will > be the same issue. pKVM doesn't really need to unmap the memory. What is really important is that the memory is not GUP'able. Having private memory mapped and then accessed by a misbehaving/malicious process will reinject a fault into the misbehaving process. Cheers, /fuad > > > > You were initially considering a KVM ioctl for mapping, which might be > > better suited for this since KVM knows which pages are shared and > > which ones are private. So routing things through KVM might simplify > > things and allow it to enforce all the necessary restrictions (e.g., > > private memory cannot be mapped). What do you think? > > > > Thanks, > > /fuad