Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 4 Mar 2025 19:04:49 +1100

On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote:
> Hello,
> 
> I've recently gotten annoyed with the current reference counting rules that
> exist in the file system arena, specifically this pattern of having 0 referenced
> objects that indicate that they're ready to be reclaimed.

Which means you are now talking about inode life cycle issues :)

There is some recent relevant discussion here:

https://lore.kernel.org/linux-fsdevel/ZvNWqhnUgqk5BlS4@xxxxxxxxxxxxxxxxxxx/

And here:

https://lore.kernel.org/all/20241002014017.3801899-1-david@xxxxxxxxxxxxx/

> This pattern consistently bites us in the ass, is error prone, gives us a lot of
> complicated logic around when an object is actually allowed to be touched versus
> when it is not.
> 
> We do this everywhere, with inodes, dentries, and folios, but I specifically
> went to change inodes recently thinking it would be the easiest, and I've run
> into a few big questions.  Currently I've got about ~30 patches, and that is
> mostly just modifying the existing file systems for a new inode_operation.
> Before I devote more time to this silly path, I figured it'd be good to bring it
> up to the group to get some input on what possible better solutions there would
> be.

I want to get rid of the inode cache (i.e. the unreferenced inodes
on the LRU) altogether. To do that, everything that needs the inode
to stay in memory (e.g. writeback) needs to hold an active reference
to the inode.

As you've noticed, there are lots of hacks in code that tries to
hang subsystem state/objects off the inode without taking an inode
reference.

> I'll try to make this as easy to follow as possible, but I spent a full day and
> a half writing code and thinking about this and it's kind of complicated.  I'll
> break this up into sections to try and make it easier to digest.
> 
> WHAT DO I WANT
> 
> I want to have refcount 0 == we're freeing the object.  This will give us clear
> "I'm using this object, thus I have a reference count on it" rules, and we can
> (hopefully) eliminate a lot of the complicated freeing logic (I_FREEING |
> I_WILL_FREE).

Yes. IMO, when iput_final() is called, the VFS inode should be kaput
never to be found again. Nothing in the VFS should be accessing it,
nor should any VFS level code be allowed to take a new reference to
it at that point in time.

> HOW DO I WANT TO DO THIS
> 
> Well obviously we keep a reference count always whenever we are using the inode,
> and we hold a reference when it is on a list.  This means the i_io_list holds a
> reference to the inode, that means the LRU list holds a reference to the inode.

Add to that list:

- fsnotify objects that are currently sweep from superblock teardown
  via an inode cache walk.
- LSM modules (e.g. landlock) that sweep state from superblock
  teardown via an inode cache walk.
- dquots
- writeback
- mapping tree containing folios (*)
- anything else that stores non-core state or private objects
  in/on the VFS inode.

(*) this is why __inode_add_lru() is such a mess and why I NACKed it
being used in the mm subsystem in the first place. The mm subsystem
uses this to prevent unreferenced inodes that have page cache
attached to them from being added to the LRU. Instead, the mm
subsystem has been allows to prevent unreferenced inodes from being
tracked by the VFS whilst the mm subsystem tracks and ages out mm
owned objects attached to the inodes.

The whole reason this was done  is to prevent the inode cache
shrinker from tossing inodes with cached pages before the workingset
aging mechanism decides the page cache needs to be tossed. But to do
this, games had to be played with the LRU because the places where
the mm wanted to mark the inode as "can be reclaimed" are places
where it is not safe to call iput()....

IOWs, any "all inode usage needs to be reference counted" is going
to have to solve this .... mess in some way....

> This makes LRU handling easier, we just walk the objects and drop our reference
> to the object. If it was truly the last reference then we free it, otherwise it
> will get added back onto the LRU list when the next guy does an iput().
> 
> POTENTIAL PROBLEM #1
> 
> Now we're actively checking to see if this inode is on the LRU list and
> potentially taking the lru list lock more often.  I don't think this will be the
> case, as we would check the inode flags before we take the lock, so we would
> martinally increase the lock contention on the LRU lock.

> We could mitigate this
> by doing the LRU list add at lookup time, where we already have to grab some of
> these locks, but I don't want to get into premature optimization territory here.
> I'm just surfacing it as a potential problem.

The inode cache (and the LRU) needs to go away. Reference count
everything so page cache residency doesn't trigger inode reclaim,
and otherwise the dentry cache LRU keeps the hot working set on
inodes in cache. There is no reason for maintaining a separately
aged inode cache these days....

> POTENTIAL PROBLEM #2
> 
> We have a fair bit of logic in writeback around when we can just skip writeback,
> which amounts to we're currently doing the final truncate on an inode with
> ->i_nlink set.  This is kind of a big problem actually, as we could no
> potentially end up with a large dirty inode that has an nlink of 0, and no
> current users, but would now be written back because it has a reference on it
> from writeback.  Before we could get into the iput() and clean everything up
> before writeback would occur.  Now writeback would occur, and then we'd clean up
> the inode.

Yup, an active/passive reference pattern.

> SOLUTION FOR POTENTIAL PROBLEM #1
> 
> I think we ignore this for now, get the patches written, do some benchmarking
> and see if this actually shows up in benchmarks.  If it does then we come up
> with strategies to resolve this at that point.

Oh, I pretty much guarantee moving away from lazy LRU management
will show up in any benchmark that stresses cold cache behaviour :/

> SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here
> 
> My initial thought was to just move the final unlink logic outside of evict, and

That's what XFS does - we don't actually process the final unlink
stage until after ->destroy_inode has been called. We defer that
processing to background inodegc workers (i.e. processed after
unlink() returns to userspace), and it all happens without the VFS
knowing anything about it.

> create a new reference count that represents the actual use of the inode.  Then
> when the actual use went to 0 we would do the final unlink, de-coupling the
> cleanup of the on-disk inode (in the case of local file systems) from the
> freeing of the memory.

Can we defer the final filesystem unlink processing like
XFS does? The filesystem effectively controls freeing of the inode
object, so it can live as long as the filesystem wants once the
VFS is actively done with it.

> This is a nice to have because the other thing that bites us occasionally is an
> iput() in a place where we don't necessarily want to be/is safe to do the final
> truncate on the inode.  This would allow us to do the final truncate at a time
> when it is safe to do so.

Yes, this is one of the reasons why XFS has traditionally done this
final unlink stage outside the VFS inode life cycle. We also do
cleanup of speculative prealloc outside the VFS inode life cycle via
the same code paths, too.

> However this means adding a different reference count to the inode.  I started
> to do this work, but it runs into some ugliness around ->tmpfile and file
> systems that don't use the normal inode caching things (bcachefs, xfs).  I do
> like this solution, but I'm not sure if it's worth the complexity.

We should be able to make it work just fine with XFS if it allows
decoupled cleanup. i.e. it is run via a new cleanup method that
would be called instead of ->destroy_inode, leaving the filesystem
to call destroy_inode() when it is done?

> The other solution here is to just say screw it, we'll just always writeback
> dirty inodes, and if they were unlinked then they get unlinked like always.  I
> think this is also a fine solution, because generally speaking if you've got
> memory pressure on the system and the file is dirty and still open, you'll be
> writing it back normally anyway.  But I don't know how people feel about this.
> 
> CONCLUSION
> 
> I'd love some feedback on my potential problems and solutions, as well as any
> other problems people may see.  If we can get some discussion beforehand I can
> finish up these patches and get some testing in before LSFMMBPF and we can have
> a proper in-person discussion about the realities of the patchset.  Thanks,

I think there's several overlapping issues here:

1. everything that stores state on/in the VFS inode needs to take a
   reference to the inode.
2. Hacks like "walk the superblock inode list to visit every cached
   inode without holding references to them" need to go away
3. inode eviction needs to be reworked to allow decoupled processing
   of the inode once all VFS references go away.
4. The inode LRU (i.e. unreferenced inode caching) needs to go away

Anything else I missed? :)

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx