Re: [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup

Ackerley Tng <ackerleytng@xxxxxxxxxx> · Tue, 01 Oct 2024 23:00:38 +0000

Vishal Annapurve <vannapurve@xxxxxxxxxx> writes:

> On Wed, Sep 11, 2024 at 1:44 AM Ackerley Tng <ackerleytng@xxxxxxxxxx> wrote:
>>
>> ...
>> +}
>> +
>> +static void kvm_gmem_evict_inode(struct inode *inode)
>> +{
>> +       u64 flags = (u64)inode->i_private;
>> +
>> +       if (flags & KVM_GUEST_MEMFD_HUGETLB)
>> +               kvm_gmem_hugetlb_teardown(inode);
>> +       else
>> +               truncate_inode_pages_final(inode->i_mapping);
>> +
>> +       clear_inode(inode);
>> +}
>> +
>>  static const struct super_operations kvm_gmem_super_operations = {
>>         .statfs         = simple_statfs,
>> +       .evict_inode    = kvm_gmem_evict_inode,
>
> Ackerley, can we use free_inode[1] callback to free any special
> metadata associated with the inode instead of relying on
> super_operations?
>
> [1] https://elixir.bootlin.com/linux/v6.11/source/include/linux/fs.h#L719
>

.free_inode() is not a direct replacement for .evict_inode().

If the .free_inode() op is NULL, free_inode_nonrcu(inode) handles freeing the
struct inode itself. Hence, the .free_inode() op is meant for freeing the inode
struct.

.free_inode() should undo what .alloc_inode() does.

There's more information about the ops free_inode() here
https://docs.kernel.org/filesystems/porting.html, specifically

| Rules for inode destruction:
|
| + if ->destroy_inode() is non-NULL, it gets called
| + if ->free_inode() is non-NULL, it gets scheduled by call_rcu()
| + combination of NULL ->destroy_inode and NULL ->free_inode is treated as
|   NULL/free_inode_nonrcu, to preserve the compatibility.

The common setup is to have a larger containing struct containing a struct
inode, and the .free_inode() op will then free the larger struct. In our case,
we're not using a containing struct for the metadata, so .free_inode() isn't the
appropriate op.

I think this question might be related to Sean's question at LPC about whether
it is necessary for guest_memfd to have its own mount, as opposed to using the
anon_inode_mnt.

I believe having its own mount is the correct approach, my reasoning is as
follows

1. We want to clean up these inode metadata when the last reference to the inode
   is dropped
2. That means using some op on the iput_final() path.
3. All the ops on the iput_final() path are in struct super_operations, which is
   part of struct super_block
4. struct super_block should be used together with a mount

Hence, I think it is correct to have a guest_memfd mount. I guess it might be
possible to have a static super_block without a mount, but that seems hacky and
brittle, and I'm not aware of any precedent for a static super_block.

Sean, what are your concerns with having a guest_memfd mount?

Comparing the callbacks along the iput_final() path, we have these:

+ .drop_inode() determines whether to evict the inode, so that's not the
  approprate op.
+ .evict_inode() is the current proposal, which is a place where the inode's
  fields are cleaned up. HugeTLB uses this to clean up resv_map, which it also
  stores in inode->i_mapping->i_private_data.
+ .destroy_inode() should clean up inode allocation if inode allocation involves
  a containing struct (like shmem_inode_info). Shmem uses this to clean up a
  struct shared_policy, which we will eventually need to store as well.
+ .free_inode() is the rcu-delayed part that completes inode cleanup.

Using .free_inode() implies using a containing struct to follow the
convention. Between putting metadata in a containing struct and using
inode->i_mapping->i_private_data, I think using inode->i_mapping->i_private_data
is less complex since it avoids needing a custom .alloc_inode() op.

Other than using inode->i_mapping->i_private_data, there's the option of
combining the metadata with guest_memfd flags, and storing everything in
inode->i_private.

Because inode->i_mapping actually points to inode->i_data and i_data is
a part of the inode (not a pointer), .evict_inode() is still the op to
use to clean both inode->i_mapping->i_private_data and inode->i_private.

I think we should stick with storing metadata (faultability xarray and hugetlb
pool reference) in inode->i_mapping->i_private_data because both of these are
properties of the page cache/filemap.

When we need to store a memory policy, we might want to use .destroy_inode() to
align with shmem.

What do you all think?

And there's no way to set inode->free_inode directly and skip copying
from inode->i_sb->s_op. All the code paths going to i_callback() copy
inode->i_sb->s_op->free_inode to inode->free_inode before calling
.free_inode() in i_callback() to complete the inode cleanup.