On Tue, Nov 29, 2022 at 10:06:15PM +0800, Chao Peng wrote: > On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote: > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote: > ... > > > +static long restrictedmem_fallocate(struct file *file, int mode, > > > + loff_t offset, loff_t len) > > > +{ > > > + struct restrictedmem_data *data = file->f_mapping->private_data; > > > + struct file *memfd = data->memfd; > > > + int ret; > > > + > > > + if (mode & FALLOC_FL_PUNCH_HOLE) { > > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > > > + return -EINVAL; > > > + } > > > + > > > + restrictedmem_notifier_invalidate(data, offset, offset + len, true); > > > > The KVM restrictedmem ops seem to expect pgoff_t, but here we pass > > loff_t. For SNP we've made this strange as part of the following patch > > and it seems to produce the expected behavior: > > That's correct. Thanks. > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2Fd669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6&data=05%7C01%7Cmichael.roth%40amd.com%7C99e80696067a40d42f6e08dad2138556%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053278531323330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WDj4KxJjhcntBWJUGCjNmMPfZMGQkCSaAo6ElYrGgF0%3D&reserved=0 > > > > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len); > > > + restrictedmem_notifier_invalidate(data, offset, offset + len, false); > > > + return ret; > > > +} > > > + > > > > <snip> > > > > > +int restrictedmem_get_page(struct file *file, pgoff_t offset, > > > + struct page **pagep, int *order) > > > +{ > > > + struct restrictedmem_data *data = file->f_mapping->private_data; > > > + struct file *memfd = data->memfd; > > > + struct page *page; > > > + int ret; > > > + > > > + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE); > > > > This will result in KVM allocating pages that userspace hasn't necessary > > fallocate()'d. In the case of SNP we need to get the PFN so we can clean > > up the RMP entries when restrictedmem invalidations are issued for a GFN > > range. > > Yes fallocate() is unnecessary unless someone wants to reserve some > space (e.g. for determination or performance purpose), this matches its > semantics perfectly at: > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman2%2Ffallocate.2.html&data=05%7C01%7Cmichael.roth%40amd.com%7C99e80696067a40d42f6e08dad2138556%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053278531323330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=67sdTY47cM1IBUG2eJCltYF5SyGOpd9%2FVxVlHUw02tU%3D&reserved=0 > > > > > If the guest supports lazy-acceptance however, these pages may not have > > been faulted in yet, and if the VMM defers actually fallocate()'ing space > > until the guest actually tries to issue a shared->private for that GFN > > (to support lazy-pinning), then there may never be a need to allocate > > pages for these backends. > > > > However, the restrictedmem invalidations are for GFN ranges so there's > > no way to know inadvance whether it's been allocated yet or not. The > > xarray is one option but currently it defaults to 'private' so that > > doesn't help us here. It might if we introduced a 'uninitialized' state > > or something along that line instead of just the binary > > 'shared'/'private' though... > > How about if we change the default to 'shared' as we discussed at > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FY35gI0L8GMt9%2BOkK%40google.com%2F&data=05%7C01%7Cmichael.roth%40amd.com%7C99e80696067a40d42f6e08dad2138556%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053278531323330%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qzWObDo7ZHW4YjuAjZ5%2B1wEwbqymgBiNM%2BYXiyUSBdI%3D&reserved=0? Need to look at this a bit more, but I think that could work as well. > > > > But for now we added a restrictedmem_get_page_noalloc() that uses > > SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch > > of memory as part of guest shutdown, and a > > kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But > > maybe a boolean param is better? Or maybe SGP_NOALLOC is the better > > default, and we just propagate an error to userspace if they didn't > > fallocate() in advance? > > This (making fallocate() a hard requirement) not only complicates the > userspace but also forces the lazy-faulting going through a long path of > exiting to userspace. Unless we don't have other options I would not go > this way. Unless I'm missing something, it's already the case that userspace is responsible for handling all the shared->private transitions in response to KVM_EXIT_MEMORY_FAULT or (in our case) KVM_EXIT_VMGEXIT. So it only places the additional requirements on the VMM that if they *don't* preallocate, then they'll need to issue the fallocate() prior to issuing the KVM_MEM_ENCRYPT_REG_REGION ioctl in response to these events. QEMU for example already has a separate 'prealloc' option for cases where they want to prefault all the guest memory, so it makes sense to continue making that an optional thing with regard to UPM. -Mike > > Chao > > > > -Mike > > > > > + if (ret) > > > + return ret; > > > + > > > + *pagep = page; > > > + if (order) > > > + *order = thp_order(compound_head(page)); > > > + > > > + SetPageUptodate(page); > > > + unlock_page(page); > > > + > > > + return 0; > > > +} > > > +EXPORT_SYMBOL_GPL(restrictedmem_get_page); > > > -- > > > 2.25.1 > > >