Vishal Annapurve <vannapurve@xxxxxxxxxx> writes: > On Wed, Sep 11, 2024 at 1:44 AM Ackerley Tng <ackerleytng@xxxxxxxxxx> wrote: >> >> ... >> +} >> + >> +static void kvm_gmem_evict_inode(struct inode *inode) >> +{ >> + u64 flags = (u64)inode->i_private; >> + >> + if (flags & KVM_GUEST_MEMFD_HUGETLB) >> + kvm_gmem_hugetlb_teardown(inode); >> + else >> + truncate_inode_pages_final(inode->i_mapping); >> + >> + clear_inode(inode); >> +} >> + >> static const struct super_operations kvm_gmem_super_operations = { >> .statfs = simple_statfs, >> + .evict_inode = kvm_gmem_evict_inode, > > Ackerley, can we use free_inode[1] callback to free any special > metadata associated with the inode instead of relying on > super_operations? > > [1] https://elixir.bootlin.com/linux/v6.11/source/include/linux/fs.h#L719 > .free_inode() is not a direct replacement for .evict_inode(). If the .free_inode() op is NULL, free_inode_nonrcu(inode) handles freeing the struct inode itself. Hence, the .free_inode() op is meant for freeing the inode struct. .free_inode() should undo what .alloc_inode() does. There's more information about the ops free_inode() here https://docs.kernel.org/filesystems/porting.html, specifically | Rules for inode destruction: | | + if ->destroy_inode() is non-NULL, it gets called | + if ->free_inode() is non-NULL, it gets scheduled by call_rcu() | + combination of NULL ->destroy_inode and NULL ->free_inode is treated as | NULL/free_inode_nonrcu, to preserve the compatibility. The common setup is to have a larger containing struct containing a struct inode, and the .free_inode() op will then free the larger struct. In our case, we're not using a containing struct for the metadata, so .free_inode() isn't the appropriate op. I think this question might be related to Sean's question at LPC about whether it is necessary for guest_memfd to have its own mount, as opposed to using the anon_inode_mnt. I believe having its own mount is the correct approach, my reasoning is as follows 1. We want to clean up these inode metadata when the last reference to the inode is dropped 2. That means using some op on the iput_final() path. 3. All the ops on the iput_final() path are in struct super_operations, which is part of struct super_block 4. struct super_block should be used together with a mount Hence, I think it is correct to have a guest_memfd mount. I guess it might be possible to have a static super_block without a mount, but that seems hacky and brittle, and I'm not aware of any precedent for a static super_block. Sean, what are your concerns with having a guest_memfd mount? Comparing the callbacks along the iput_final() path, we have these: + .drop_inode() determines whether to evict the inode, so that's not the approprate op. + .evict_inode() is the current proposal, which is a place where the inode's fields are cleaned up. HugeTLB uses this to clean up resv_map, which it also stores in inode->i_mapping->i_private_data. + .destroy_inode() should clean up inode allocation if inode allocation involves a containing struct (like shmem_inode_info). Shmem uses this to clean up a struct shared_policy, which we will eventually need to store as well. + .free_inode() is the rcu-delayed part that completes inode cleanup. Using .free_inode() implies using a containing struct to follow the convention. Between putting metadata in a containing struct and using inode->i_mapping->i_private_data, I think using inode->i_mapping->i_private_data is less complex since it avoids needing a custom .alloc_inode() op. Other than using inode->i_mapping->i_private_data, there's the option of combining the metadata with guest_memfd flags, and storing everything in inode->i_private. Because inode->i_mapping actually points to inode->i_data and i_data is a part of the inode (not a pointer), .evict_inode() is still the op to use to clean both inode->i_mapping->i_private_data and inode->i_private. I think we should stick with storing metadata (faultability xarray and hugetlb pool reference) in inode->i_mapping->i_private_data because both of these are properties of the page cache/filemap. When we need to store a memory policy, we might want to use .destroy_inode() to align with shmem. What do you all think? And there's no way to set inode->free_inode directly and skip copying from inode->i_sb->s_op. All the code paths going to i_callback() copy inode->i_sb->s_op->free_inode to inode->free_inode before calling .free_inode() in i_callback() to complete the inode cleanup.