Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote:
> Hello,
> 
> I've recently gotten annoyed with the current reference counting rules that
> exist in the file system arena, specifically this pattern of having 0 referenced
> objects that indicate that they're ready to be reclaimed.
> 
> This pattern consistently bites us in the ass, is error prone, gives us a lot of
> complicated logic around when an object is actually allowed to be touched versus
> when it is not.
> 
> We do this everywhere, with inodes, dentries, and folios, but I specifically
> went to change inodes recently thinking it would be the easiest, and I've run
> into a few big questions.  Currently I've got about ~30 patches, and that is
> mostly just modifying the existing file systems for a new inode_operation.
> Before I devote more time to this silly path, I figured it'd be good to bring it
> up to the group to get some input on what possible better solutions there would
> be.
> 
> I'll try to make this as easy to follow as possible, but I spent a full day and
> a half writing code and thinking about this and it's kind of complicated.  I'll
> break this up into sections to try and make it easier to digest.
> 
> WHAT DO I WANT
> 
> I want to have refcount 0 == we're freeing the object.  This will give us clear
> "I'm using this object, thus I have a reference count on it" rules, and we can
> (hopefully) eliminate a lot of the complicated freeing logic (I_FREEING |
> I_WILL_FREE).

Yeah, I want to see I_FREEING and I_WILL_FREE stuff to go away. This bit
fiddling and waiting is terribly opaque for anyone who hasn't worked on
this since the dawn of time. So I'm all for it.

> 
> HOW DO I WANT TO DO THIS
> 
> Well obviously we keep a reference count always whenever we are using the inode,
> and we hold a reference when it is on a list.  This means the i_io_list holds a
> reference to the inode, that means the LRU list holds a reference to the inode.
> 
> This makes LRU handling easier, we just walk the objects and drop our reference
> to the object.  If it was truly the last reference then we free it, otherwise it
> will get added back onto the LRU list when the next guy does an iput().
> 
> POTENTIAL PROBLEM #1
> 
> Now we're actively checking to see if this inode is on the LRU list and
> potentially taking the lru list lock more often.  I don't think this will be the
> case, as we would check the inode flags before we take the lock, so we would
> martinally increase the lock contention on the LRU lock.  We could mitigate this
> by doing the LRU list add at lookup time, where we already have to grab some of
> these locks, but I don't want to get into premature optimization territory here.
> I'm just surfacing it as a potential problem.

Yes, ignore it for now.

So I agree that if we can try and remove the inode cache altogether that
would be pretty awesome and we know that we have support for attempting
that from Linus. But I'm not sure what regression potential that has.
There might just be enough implicit behavior that workloads depend on
that will bite us in the ass.

But I don't think you need to address this in this series. Your changes
might end up making it easier to experiemnt with the inode cache removal
though.

> POTENTIAL PROBLEM #2
> 
> We have a fair bit of logic in writeback around when we can just skip writeback,
> which amounts to we're currently doing the final truncate on an inode with
> ->i_nlink set.  This is kind of a big problem actually, as we could no
> potentially end up with a large dirty inode that has an nlink of 0, and no
> current users, but would now be written back because it has a reference on it
> from writeback.  Before we could get into the iput() and clean everything up
> before writeback would occur.  Now writeback would occur, and then we'd clean up
> the inode.

So in the old pattern you'd call iput_final() and then do writeback.
Whereas in the new pattern you'd do writeback before iput_final().
And this is a problem because it potentially delays freeing of the inode
for a long time?

> 
> SOLUTION FOR POTENTIAL PROBLEM #1
> 
> I think we ignore this for now, get the patches written, do some benchmarking
> and see if this actually shows up in benchmarks.  If it does then we come up
> with strategies to resolve this at that point.
> 
> SOLUTION FOR POTENTIAL PROBLEM #2 <--- I would like input here
> 
> My initial thought was to just move the final unlink logic outside of evict, and
> create a new reference count that represents the actual use of the inode.  Then
> when the actual use went to 0 we would do the final unlink, de-coupling the
> cleanup of the on-disk inode (in the case of local file systems) from the
> freeing of the memory.

I really do like active/passive reference counts. I've used that pattern
for mount namespaces, seccomp filters and some other stuff quite
successfully. So I'm somewhat inclined to prefer that solution.

Imho, when active/reference patterns are needed or useful then it's
almost always because the original single reference counting mechanism
was semantically vague because it mixed two different meanings of the
reference count. So switching to an active/passive pattern will end up
clarifying things.

> This is a nice to have because the other thing that bites us occasionally is an
> iput() in a place where we don't necessarily want to be/is safe to do the final
> truncate on the inode.  This would allow us to do the final truncate at a time
> when it is safe to do so.
> 
> However this means adding a different reference count to the inode.  I started
> to do this work, but it runs into some ugliness around ->tmpfile and file
> systems that don't use the normal inode caching things (bcachefs, xfs).  I do
> like this solution, but I'm not sure if it's worth the complexity.
> 
> The other solution here is to just say screw it, we'll just always writeback
> dirty inodes, and if they were unlinked then they get unlinked like always.  I
> think this is also a fine solution, because generally speaking if you've got
> memory pressure on the system and the file is dirty and still open, you'll be
> writing it back normally anyway.  But I don't know how people feel about this.
> 
> CONCLUSION
> 
> I'd love some feedback on my potential problems and solutions, as well as any
> other problems people may see.  If we can get some discussion beforehand I can
> finish up these patches and get some testing in before LSFMMBPF and we can have
> a proper in-person discussion about the realities of the patchset.  Thanks,
> 
> Josef




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux