On Wed, Sep 15, 2021 at 05:11:47PM +0300, Kirill A. Shutemov wrote: > On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote: > > On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote: > > > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote: > > > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote: > > > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote: > > > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem? > > > > > > > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look > > > > > closer into locking next. > > > > > > > > We can decisively eliminate this sort of failure by making the switch > > > > happen at open time instead of after. For a memfd-like API, this would > > > > be straightforward. For a filesystem, it would take a bit more thought. > > > > > > I think it should work fine as long as we check seals after i_size in the > > > read path. See the comment in shmem_file_read_iter(). > > > > > > Below is updated version. I think it should be good enough to start > > > integrate with KVM. > > > > > > I also attach a test-case that consists of kernel patch and userspace > > > program. It demonstrates how it can be integrated into KVM code. > > > > > > One caveat I noticed is that guest_ops::invalidate_page_range() can be > > > called after the owner (struct kvm) has being freed. It happens because > > > memfd can outlive KVM. So the callback has to check if such owner exists, > > > than check that there's a memslot with such inode. > > > > Would introducing memfd_unregister_guest() fix this? > > I considered this, but it get complex quickly. > > At what point it gets called? On KVM memslot destroy? I meant when the VM gets destroyed. > > What if multiple KVM slot share the same memfd? Add refcount into memfd on > how many times the owner registered the memfd? > > It would leave us in strange state: memfd refcount owners (struct KVM) and > KVM memslot pins the struct file. Weird refcount exchnage program. > > I hate it. But yes agree things will get much complex in practice. > > > > I guess it should be okay: we have vm_list we can check owner against. > > > We may consider replace vm_list with something more scalable if number of > > > VMs will get too high. > > > > > > Any comments? > > > > > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h > > > index 4f1600413f91..3005e233140a 100644 > > > --- a/include/linux/memfd.h > > > +++ b/include/linux/memfd.h > > > @@ -4,13 +4,34 @@ > > > > > > #include <linux/file.h> > > > > > > +struct guest_ops { > > > + void (*invalidate_page_range)(struct inode *inode, void *owner, > > > + pgoff_t start, pgoff_t end); > > > +}; > > > > I can see there are two scenarios to invalidate page(s), when punching a > > hole or ftruncating to 0, in either cases KVM should already been called > > with necessary information from usersapce with memory slot punch hole > > syscall or memory slot delete syscall, so wondering this callback is > > really needed. > > So what you propose? Forbid truncate/punch from userspace and make KVM > handle punch hole/truncate from within kernel? I think it's layering > violation. As far as I understand the flow for punch hole/truncate in this design, there will be two steps for userspace: 1. punch hole/delete kvm memory slot, and then 2. puncn hole/truncate on the memory backing store fd. In concept we can do whatever needed for invalidation in either steps. If we can do the invalidation in step 1 then we don’t need bothering this callback. This is what I mean but agree the current callback can also work. > > > > + > > > +struct guest_mem_ops { > > > + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset); > > > + void (*put_unlock_pfn)(unsigned long pfn); > > > > Same as above, I’m not clear on which time put_unlock_pfn() would be > > called, I’m thinking the page can be put_and_unlock when userspace > > punching a hole or ftruncating to 0 on the fd. > > No. put_unlock_pfn() has to be called after the pfn is in SEPT. This way > we close race between SEPT population and truncate/punch. get_lock_pfn() > would stop truncate untile put_unlock_pfn() called. Okay, makes sense. Thanks, Chao