Re: [LSF/MM/BPF TOPIC] Changing reference counting rules for inodes

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 5 Mar 2025 09:23:10 +1100

On Tue, Mar 04, 2025 at 10:02:57AM -0500, Josef Bacik wrote:
> On Tue, Mar 04, 2025 at 07:04:49PM +1100, Dave Chinner wrote:
> > On Mon, Mar 03, 2025 at 12:00:29PM -0500, Josef Bacik wrote:
> > > CONCLUSION
> > > 
> > > I'd love some feedback on my potential problems and solutions, as well as any
> > > other problems people may see.  If we can get some discussion beforehand I can
> > > finish up these patches and get some testing in before LSFMMBPF and we can have
> > > a proper in-person discussion about the realities of the patchset.  Thanks,
> > 
> > I think there's several overlapping issues here:
> > 
> > 1. everything that stores state on/in the VFS inode needs to take a
> >    reference to the inode.
> > 2. Hacks like "walk the superblock inode list to visit every cached
> >    inode without holding references to them" need to go away
> > 3. inode eviction needs to be reworked to allow decoupled processing
> >    of the inode once all VFS references go away.
> > 4. The inode LRU (i.e. unreferenced inode caching) needs to go away
> > 
> > Anything else I missed? :)
> 
> Nope that's all what I've got on my list.  I want to run my plan for decoupling
> the inode eviction by you before I send my patches out to see if you have a
> better idea.  I know XFS has all this delaying stuff, but a lot of file systems
> don't have that infrastructure, and I don't want to go around adding it to
> everybody, so I still want to have a VFS hook way to do the final truncate.  The
> question is where to put it, and will it at the very least not mess you guys up,
> or in the best case be useful.

What I've been looking at locally if for the iput_final() processing
to be done asynchronously rather than in iput() context.

i.e. once the refcount goes to zero, we do all the cleanup stuff we
can do without blocking, then punt it to an async context to finish
the cleanup.

The problems I've had with this are entirely from the "inodes with
refcount of zero are still visible and accessed by the VFS code",
and I hadn't got to solving that problem. The sb list iter work I
posted a while back (and need to finish off) was the first step I'd
made in that direction.

> We're agreed on the active/passive refcount, so the active refcount will be the
> arbiter of when we can do the final truncate.  My current patchset adds a new
> ->final_unlink() method to the inode_operations, and if it's set we call it when
> the active refcount goes to 0.  Obviously a passive refcount is still held on
> the inode while this operation is occurring.

Yes, I think this allows async processing of the inode after the
active refcount reaches zero and iput_final() is called. The async
processing retains a passive reference all the way through until the
inode is completely destroyed and then freed.

> I just want to spell it out, because some of what you've indicated is you want
> the file system to control when the final unlink happens, even outside of the
> VFS.

What I really want is for all iput_final/evict handling that may
need to block to be punted to an async processing context. That's
what we already do with XFS, and what I'd really like for the VFS to
natively support. i.e. passive reference counts allow decoupling of
inode eviction work from the task context that drops the last active
reference.

We need to do this to allow the nasty mm LRU hacks to be turned into
active reference counts because those hacks require iput() to be
safe to call from any context...

> I'm not sure if you intend to say that it happens without a struct inode,
> or you just mean that we'll delay it for whenever the file system wants to do
> it, and we'll be holding the struct inode for that entire time.

Yes, XFS holds the struct inode via the struct xfs_inode
wrapped around the outside of it. i.e. the VFS inode lifecycle is a
subset of the XFS inode lifecycle. 

i.e. I would expect that a filesystem with such a VFs vs FS
lifecycle setup to maintain a passive struct inode reference count
for it's own inode wrapper structure. This means the object will
remain valid for as long as the filesystem needs it's inode wrapper
structure to live.

The filesystem would add the passive reference at inode allocation.
This passive reference then gets removed when the filesystem is done
tearing down it's internal inode structure and at that point the
whole object can be freed.

Also, we already do a -lot- of work behind the scenes without inode
references in the internal XFS code (e.g. inode writeback does
lockless, cache coherent unreferenced lookups of inodes). Converting
these internal lookups to use passive references would greatly
simplify the internal lookup vs reclaim interactions that we
currently have to handle....

Cheers,

Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx