Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes

Josef Bacik <josef@xxxxxxxxxxxxxx> · Tue, 4 Mar 2025 09:56:14 -0500

On Tue, Mar 04, 2025 at 11:19:34AM +0100, Christian Brauner wrote:
> On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote:
> > Hello,
> > 
> > I've recently gotten annoyed with the current reference counting rules that
> > exist in the file system arena, specifically this pattern of having 0 referenced
> > objects that indicate that they're ready to be reclaimed.
> > 
> > This pattern consistently bites us in the ass, is error prone, gives us a lot of
> > complicated logic around when an object is actually allowed to be touched versus
> > when it is not.
> > 
> > We do this everywhere, with inodes, dentries, and folios, but I specifically
> > went to change inodes recently thinking it would be the easiest, and I've run
> > into a few big questions.  Currently I've got about ~30 patches, and that is
> > mostly just modifying the existing file systems for a new inode_operation.
> > Before I devote more time to this silly path, I figured it'd be good to bring it
> > up to the group to get some input on what possible better solutions there would
> > be.
> > 
> > I'll try to make this as easy to follow as possible, but I spent a full day and
> > a half writing code and thinking about this and it's kind of complicated.  I'll
> > break this up into sections to try and make it easier to digest.
> > 
> > WHAT DO I WANT
> > 
> > I want to have refcount 0 == we're freeing the object.  This will give us clear
> > "I'm using this object, thus I have a reference count on it" rules, and we can
> > (hopefully) eliminate a lot of the complicated freeing logic (I_FREEING |
> > I_WILL_FREE).
> 
> Yeah, I want to see I_FREEING and I_WILL_FREE stuff to go away. This bit
> fiddling and waiting is terribly opaque for anyone who hasn't worked on
> this since the dawn of time. So I'm all for it.
> 
> > 
> > HOW DO I WANT TO DO THIS
> > 
> > Well obviously we keep a reference count always whenever we are using the inode,
> > and we hold a reference when it is on a list.  This means the i_io_list holds a
> > reference to the inode, that means the LRU list holds a reference to the inode.
> > 
> > This makes LRU handling easier, we just walk the objects and drop our reference
> > to the object.  If it was truly the last reference then we free it, otherwise it
> > will get added back onto the LRU list when the next guy does an iput().
> > 
> > POTENTIAL PROBLEM #1
> > 
> > Now we're actively checking to see if this inode is on the LRU list and
> > potentially taking the lru list lock more often.  I don't think this will be the
> > case, as we would check the inode flags before we take the lock, so we would
> > martinally increase the lock contention on the LRU lock.  We could mitigate this
> > by doing the LRU list add at lookup time, where we already have to grab some of
> > these locks, but I don't want to get into premature optimization territory here.
> > I'm just surfacing it as a potential problem.
> 
> Yes, ignore it for now.
> 
> So I agree that if we can try and remove the inode cache altogether that
> would be pretty awesome and we know that we have support for attempting
> that from Linus. But I'm not sure what regression potential that has.
> There might just be enough implicit behavior that workloads depend on
> that will bite us in the ass.
> 
> But I don't think you need to address this in this series. Your changes
> might end up making it easier to experiemnt with the inode cache removal
> though.
> 
> > POTENTIAL PROBLEM #2
> > 
> > We have a fair bit of logic in writeback around when we can just skip writeback,
> > which amounts to we're currently doing the final truncate on an inode with
> > ->i_nlink set.  This is kind of a big problem actually, as we could no
> > potentially end up with a large dirty inode that has an nlink of 0, and no
> > current users, but would now be written back because it has a reference on it
> > from writeback.  Before we could get into the iput() and clean everything up
> > before writeback would occur.  Now writeback would occur, and then we'd clean up
> > the inode.
> 
> So in the old pattern you'd call iput_final() and then do writeback.
> Whereas in the new pattern you'd do writeback before iput_final().
> And this is a problem because it potentially delays freeing of the inode
> for a long time?

Well we don't do the writeback in iput_final() if we've unlinked the inode.

Currently, you can have the following sequence of events

Open file
Write to file
Unlink file
Close file
  iput_final()
    remove the inode from the io_list
    wait on current writeback that may have started before the iput_final()
    truncate the inode
    truncate the page cache

In this case we avoid the writeout completely, because once we've entered
iput_final() and set I_FREEING the writeback threads will skip that inode if
they try to write it back, so we skip a whole writeback cycle.

With my naive implementation of just holding reference counts for everything,
writeback now holds its reference, so what would happen now is

Open file
Write to file
Unlink file
Close file
<some time passes>
Writeback occurs on the file
  iput_final()
    truncate the inode

So now we've written back the inode, and then we truncate it. Is this bad? Not
really, unless you're workload does this and you're on SSD's, now you've got
more write cycles than you had before.  The slightly bigger problem is now
you've got the final iput happening in the writeback threads, which will induce
latency in the whole system.

> 
> > 
> > SOLUTION FOR POTENTIAL PROBLEM #1
> > 
> > I think we ignore this for now, get the patches written, do some benchmarking
> > and see if this actually shows up in benchmarks.  If it does then we come up
> > with strategies to resolve this at that point.
> > 
> > SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here
> > 
> > My initial thought was to just move the final unlink logic outside of evict, and
> > create a new reference count that represents the actual use of the inode.  Then
> > when the actual use went to 0 we would do the final unlink, de-coupling the
> > cleanup of the on-disk inode (in the case of local file systems) from the
> > freeing of the memory.
> 
> I really do like active/passive reference counts. I've used that pattern
> for mount namespaces, seccomp filters and some other stuff quite
> successfully. So I'm somewhat inclined to prefer that solution.
> 
> Imho, when active/reference patterns are needed or useful then it's
> almost always because the original single reference counting mechanism
> was semantically vague because it mixed two different meanings of the
> reference count. So switching to an active/passive pattern will end up
> clarifying things.
> 

Ok cool, that's the path I wandered down and then wanted to make sure everybody
else was cool with it.  I'll finish up these patches and get some testing in and
get them out so we can have something concrete to look at.  I'm limiting my
kernel development time to Fridays so it'll be either end of the week or next
week when the patches show up.  Thanks,

Josef