[LSF/MM/BPF TOPIC] Changing reference counting rules for inodes

Josef Bacik <josef@xxxxxxxxxxxxxx> · Mon, 3 Mar 2025 12:00:29 -0500

Hello,

I've recently gotten annoyed with the current reference counting rules that
exist in the file system arena, specifically this pattern of having 0 referenced
objects that indicate that they're ready to be reclaimed.

This pattern consistently bites us in the ass, is error prone, gives us a lot of
complicated logic around when an object is actually allowed to be touched versus
when it is not.

We do this everywhere, with inodes, dentries, and folios, but I specifically
went to change inodes recently thinking it would be the easiest, and I've run
into a few big questions.  Currently I've got about ~30 patches, and that is
mostly just modifying the existing file systems for a new inode_operation.
Before I devote more time to this silly path, I figured it'd be good to bring it
up to the group to get some input on what possible better solutions there would
be.

I'll try to make this as easy to follow as possible, but I spent a full day and
a half writing code and thinking about this and it's kind of complicated.  I'll
break this up into sections to try and make it easier to digest.

WHAT DO I WANT

I want to have refcount 0 == we're freeing the object.  This will give us clear
"I'm using this object, thus I have a reference count on it" rules, and we can
(hopefully) eliminate a lot of the complicated freeing logic (I_FREEING |
I_WILL_FREE).

HOW DO I WANT TO DO THIS

Well obviously we keep a reference count always whenever we are using the inode,
and we hold a reference when it is on a list.  This means the i_io_list holds a
reference to the inode, that means the LRU list holds a reference to the inode.

This makes LRU handling easier, we just walk the objects and drop our reference
to the object.  If it was truly the last reference then we free it, otherwise it
will get added back onto the LRU list when the next guy does an iput().

POTENTIAL PROBLEM #1

Now we're actively checking to see if this inode is on the LRU list and
potentially taking the lru list lock more often.  I don't think this will be the
case, as we would check the inode flags before we take the lock, so we would
martinally increase the lock contention on the LRU lock.  We could mitigate this
by doing the LRU list add at lookup time, where we already have to grab some of
these locks, but I don't want to get into premature optimization territory here.
I'm just surfacing it as a potential problem.

POTENTIAL PROBLEM #2

We have a fair bit of logic in writeback around when we can just skip writeback,
which amounts to we're currently doing the final truncate on an inode with
->i_nlink set.  This is kind of a big problem actually, as we could no
potentially end up with a large dirty inode that has an nlink of 0, and no
current users, but would now be written back because it has a reference on it
from writeback.  Before we could get into the iput() and clean everything up
before writeback would occur.  Now writeback would occur, and then we'd clean up
the inode.

SOLUTION FOR POTENTIAL PROBLEM #1

I think we ignore this for now, get the patches written, do some benchmarking
and see if this actually shows up in benchmarks.  If it does then we come up
with strategies to resolve this at that point.

SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here

My initial thought was to just move the final unlink logic outside of evict, and
create a new reference count that represents the actual use of the inode.  Then
when the actual use went to 0 we would do the final unlink, de-coupling the
cleanup of the on-disk inode (in the case of local file systems) from the
freeing of the memory.

This is a nice to have because the other thing that bites us occasionally is an
iput() in a place where we don't necessarily want to be/is safe to do the final
truncate on the inode.  This would allow us to do the final truncate at a time
when it is safe to do so.

However this means adding a different reference count to the inode.  I started
to do this work, but it runs into some ugliness around ->tmpfile and file
systems that don't use the normal inode caching things (bcachefs, xfs).  I do
like this solution, but I'm not sure if it's worth the complexity.

The other solution here is to just say screw it, we'll just always writeback
dirty inodes, and if they were unlinked then they get unlinked like always.  I
think this is also a fine solution, because generally speaking if you've got
memory pressure on the system and the file is dirty and still open, you'll be
writing it back normally anyway.  But I don't know how people feel about this.

CONCLUSION

I'd love some feedback on my potential problems and solutions, as well as any
other problems people may see.  If we can get some discussion beforehand I can
finish up these patches and get some testing in before LSFMMBPF and we can have
a proper in-person discussion about the realities of the patchset.  Thanks,

Josef