On Tue, Nov 29, 2022 at 01:06:58PM -0600, Michael Roth wrote: > On Tue, Nov 29, 2022 at 10:06:15PM +0800, Chao Peng wrote: > > On Mon, Nov 28, 2022 at 06:37:25PM -0600, Michael Roth wrote: > > > On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote: > > ... > > > > +static long restrictedmem_fallocate(struct file *file, int mode, > > > > + loff_t offset, loff_t len) > > > > +{ > > > > + struct restrictedmem_data *data = file->f_mapping->private_data; > > > > + struct file *memfd = data->memfd; > > > > + int ret; > > > > + > > > > + if (mode & FALLOC_FL_PUNCH_HOLE) { > > > > + if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len)) > > > > + return -EINVAL; > > > > + } > > > > + > > > > + restrictedmem_notifier_invalidate(data, offset, offset + len, true); > > > > > > The KVM restrictedmem ops seem to expect pgoff_t, but here we pass > > > loff_t. For SNP we've made this strange as part of the following patch > > > and it seems to produce the expected behavior: > > > > That's correct. Thanks. > > > > > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmdroth%2Flinux%2Fcommit%2Fd669c7d3003ff7a7a47e73e8c3b4eeadbd2c4eb6&data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kAL42bmyBB0alVwh%2FN%2BT3D%2BiVTdxxMsJ7V4TNuCTjM4%3D&reserved=0 > > > > > > > + ret = memfd->f_op->fallocate(memfd, mode, offset, len); > > > > + restrictedmem_notifier_invalidate(data, offset, offset + len, false); > > > > + return ret; > > > > +} > > > > + > > > > > > <snip> > > > > > > > +int restrictedmem_get_page(struct file *file, pgoff_t offset, > > > > + struct page **pagep, int *order) > > > > +{ > > > > + struct restrictedmem_data *data = file->f_mapping->private_data; > > > > + struct file *memfd = data->memfd; > > > > + struct page *page; > > > > + int ret; > > > > + > > > > + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE); > > > > > > This will result in KVM allocating pages that userspace hasn't necessary > > > fallocate()'d. In the case of SNP we need to get the PFN so we can clean > > > up the RMP entries when restrictedmem invalidations are issued for a GFN > > > range. > > > > Yes fallocate() is unnecessary unless someone wants to reserve some > > space (e.g. for determination or performance purpose), this matches its > > semantics perfectly at: > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.man7.org%2Flinux%2Fman-pages%2Fman2%2Ffallocate.2.html&data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=acBSquFG%2FHtpbcZfHDZrP2O63bu06rI0pjiPJFSJSj8%3D&reserved=0 > > > > > > > > If the guest supports lazy-acceptance however, these pages may not have > > > been faulted in yet, and if the VMM defers actually fallocate()'ing space > > > until the guest actually tries to issue a shared->private for that GFN > > > (to support lazy-pinning), then there may never be a need to allocate > > > pages for these backends. > > > > > > However, the restrictedmem invalidations are for GFN ranges so there's > > > no way to know inadvance whether it's been allocated yet or not. The > > > xarray is one option but currently it defaults to 'private' so that > > > doesn't help us here. It might if we introduced a 'uninitialized' state > > > or something along that line instead of just the binary > > > 'shared'/'private' though... > > > > How about if we change the default to 'shared' as we discussed at > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2FY35gI0L8GMt9%2BOkK%40google.com%2F&data=05%7C01%7CMichael.Roth%40amd.com%7C0c26815eb6af4f1a243508dad23cf713%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638053456609134623%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Q1vZWQiZ7mx12Qn5aKl4s8Ea9hNbwCJBb%2BjiA1du3Os%3D&reserved=0? > > Need to look at this a bit more, but I think that could work as well. > > > > > > > But for now we added a restrictedmem_get_page_noalloc() that uses > > > SGP_NONE instead of SGP_WRITE to avoid accidentally allocating a bunch > > > of memory as part of guest shutdown, and a > > > kvm_restrictedmem_get_pfn_noalloc() variant to go along with that. But > > > maybe a boolean param is better? Or maybe SGP_NOALLOC is the better > > > default, and we just propagate an error to userspace if they didn't > > > fallocate() in advance? > > > > This (making fallocate() a hard requirement) not only complicates the > > userspace but also forces the lazy-faulting going through a long path of > > exiting to userspace. Unless we don't have other options I would not go > > this way. > > Unless I'm missing something, it's already the case that userspace is > responsible for handling all the shared->private transitions in response > to KVM_EXIT_MEMORY_FAULT or (in our case) KVM_EXIT_VMGEXIT. So it only > places the additional requirements on the VMM that if they *don't* > preallocate, then they'll need to issue the fallocate() prior to issuing > the KVM_MEM_ENCRYPT_REG_REGION ioctl in response to these events. > > QEMU for example already has a separate 'prealloc' option for cases > where they want to prefault all the guest memory, so it makes sense to > continue making that an optional thing with regard to UPM. Although I guess what you're suggesting doesn't stop userspace from deciding whether they want to prefault or not. I know the Google folks had some concerns over unexpected allocations causing 2x memory usage though so giving userspace full control of what is/isn't allocated in the restrictedmem backend seems to make it easier to guard against this, but I think checking the xarray and defaulting to 'shared' would work for us if that's the direction we end up going. -Mike > > -Mike > > > > > Chao > > > > > > -Mike > > > > > > > + if (ret) > > > > + return ret; > > > > + > > > > + *pagep = page; > > > > + if (order) > > > > + *order = thp_order(compound_head(page)); > > > > + > > > > + SetPageUptodate(page); > > > > + unlock_page(page); > > > > + > > > > + return 0; > > > > +} > > > > +EXPORT_SYMBOL_GPL(restrictedmem_get_page); > > > > -- > > > > 2.25.1 > > > >