Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()

Jan Kara <jack@xxxxxxx> · Wed, 9 Oct 2024 16:18:17 +0200

On Wed 09-10-24 10:44:12, Dave Chinner wrote:
> On Tue, Oct 08, 2024 at 01:23:44PM +0200, Jan Kara wrote:
> > On Tue 08-10-24 10:57:22, Amir Goldstein wrote:
> > > On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > > > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@xxxxxxx> wrote:
> > > > > >
> > > > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > > > I'd wait with that convertion until this series lands.
> > > > >
> > > > > Honza, I looked at this a bit more, particularly with an eye of "what
> > > > > happens if we just end up making the inode lifetimes subject to the
> > > > > dentry lifetimes" as suggested by Dave elsewhere.
> > > >
> > > > ....
> > > >
> > > > > which makes the fsnotify_inode_delete() happen when the inode is
> > > > > removed from the dentry.
> > > >
> > > > There may be other inode references being held that make
> > > > the inode live longer than the dentry cache. When should the
> > > > fsnotify marks be removed from the inode in that case? Do they need
> > > > to remain until, e.g, writeback completes?
> > > >
> > > 
> > > fsnotify inode marks remain until explicitly removed or until sb
> > > is unmounted (*), so other inode references are irrelevant to
> > > inode mark removal.
> > > 
> > > (*) fanotify has "evictable" inode marks, which do not hold inode
> > > reference and go away on inode evict, but those mark evictions
> > > do not generate any event (i.e. there is no FAN_UNMOUNT).
> > 
> > Yes. Amir beat me with the response so let me just add that FS_UMOUNT event
> > is for inotify which guarantees that either you get an event about somebody
> > unlinking the inode (e.g. IN_DELETE_SELF) or event about filesystem being
> > unmounted (IN_UMOUNT) if you place mark on some inode. I also don't see how
> > we would maintain this behavior with what Linus proposes.
> 
> Thanks. I didn't respond last night when I read Amir's decription
> because I wanted to think it over. Knowing where the unmount event
> requirement certainly helps.
> 
> I am probably missing something important, but it really seems to me
> that the object reference counting model is the back to
> front.  Currently the mark is being attached to the inode and then
> the inode pinned by a reference count to make the mark attached
> to the inode persistent until unmount. This then requires the inodes
> to be swept by unmount because fsnotify has effectively leaked them
> as it isn't tracking such inodes itself.
> 
> [ Keep in mind that I'm not saying this was a bad or wrong thing to
> do because the s_inodes list was there to be able to do this sort of
> lazy cleanup. But now that we want to remove the s_inodes list if at
> all possible, it is a problem we need to solve differently. ]

Yes, agreed.

> AFAICT, inotify does not appear to require the inode to send events
> - it only requires access to the inode mark itself. Hence it does
> not the inode in cache to generate IN_UNMOUNT events, it just
> needs the mark itself to be findable at unmount.  Do any of the
> other backends that require unmount notifications that require
> special access to the inode itself?

No, I don't think unmount notification requires looking at the inode and it
is inotify-specific thing as Amir wrote. We do require inode access when
generating fanotify events (to open fd where event happened) but that gets
handled separately by creating struct path when event happens and using it
for dentry_open() later when reporting to userspace so that carries its own
set on dentry + mnt references while the event is waiting in the queue.

> If not, and the fsnotify sb info is tracking these persistent marks,
> then we don't need to iterate inodes at unmount. This means we don't
> need to pin inodes when they have marks attached, and so the
> dependency on the s_inodes list goes away.
> 
> With this inverted model, we need the first fsnotify event callout
> after the inode is instantiated to look for a persistent mark for
> the inode. We know how to do this efficiently - it's exactly the
> same caching model we use for ACLs. On the first lookup, we check
> the inode for ACL data and set the ACL pointer appropriately to
> indicate that a lookup has been done and there are no ACLs
> associated with the inode.

Yes, I agree such scheme should be possible although a small snag I see is
that we need to keep in fsnotify mark enough info so that it can be
associated with an inode when it is read from the disk. And this info is
filesystem specific with uncertain size for filesystems which use iget5().
So I suspect we'll need some support from individual filesystems which is
always tedious.

> At this point, the fsnotify inode marks can all be removed from the
> inode when it is being evicted and there's no need for fsnotify to
> pin inodes at all.
> 
> > > > > Then at umount time, the dentry shrinking will deal with all live
> > > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > > just the root dentry inodes?
> > > >
> > > > I don't think even that is necessary, because
> > > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > > inode references, roots included.
> > > >
> > > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > > use of the nasty s_inodes list?
> > > >
> > > > Yes, it would, but someone who knows exactly when the fsnotify
> > > > marks can be removed needs to chime in here...
> > 
> > So fsnotify needs a list of inodes for the superblock which have marks
> > attached and for which we hold inode reference. We can keep it inside
> > fsnotify code although it would practically mean another list_head for the
> > inode for this list (probably in our fsnotify_connector structure which
> > connects list of notification marks to the inode).
> 
> I don't think that is necessary. We need to get rid of the inode
> reference, not move where we track inode references. The persistent
> object is the fsnotify mark, not the cached inode. It's the mark
> that needs to be persistent, and that's what the fsnotify code
> should be tracking.

Right, I was not precise here. We don't need a list of tracked inodes. We
are fine with a list of all marks for inodes on a superblock which we could
crawl on umount.

> The fsnotify marks are much smaller than inodes, and there going to
> be fewer cached marks than inodes, especially once inode pinning is
> removed. Hence I think this will result in a net reduction in memory
> footprint for "marked-until-unmount" configurations as we won't pin
> nearly as many inodes in cache...

I agree. If fsnotify marks stop pinning inodes, we'll probably win much
more memory by keeping inodes reclaimable than we loose by extra overhead
of the mark tracking.

> > > > > And I wonder if the quota code (which uses the s_inodes list
> > > > > to enable quotas on already mounted filesystems) could for
> > > > > all the same reasons just walk the dentry tree instead (and
> > > > > remove_dquot_ref similarly could just remove it at
> > > > > dentry_unlink_inode() time)?
> > > >
> > > > I don't think that will work because we have to be able to
> > > > modify quota in evict() processing. This is especially true
> > > > for unlinked inodes being evicted from cache, but also the
> > > > dquots need to stay attached until writeback completes.
> > > >
> > > > Hence I don't think we can remove the quota refs from the
> > > > inode before we call iput_final(), and so I think quotaoff (at
> > > > least) still needs to iterate inodes...
> > 
> > Yeah, I'm not sure how to get rid of the s_inodes use in quota
> > code. One of the things we need s_inodes list for is during
> > quotaoff on a mounted filesystem when we need to iterate all
> > inodes which are referencing quota structures and free them.  In
> > theory we could keep a list of inodes referencing quota structures
> > but that would require adding list_head to inode structure for
> > filesystems that support quotas.
> 
> I don't think that's quite true. Quota is not modular, so we can
> lazily free quota objects even when quota is turned off. All we need
> to ensure is that code checks whether quota is enabled, not for the
> existence of quota objects attached to the inode.
> 
> Hence quota-off simply turns off all the quota operations in memory,
> and normal inode eviction cleans up the stale quota objects
> naturally.

Ho, hum, possibly yes. I need to think a bit more about this.

> My main question is why the quota-on add_dquot_ref() pass is
> required. AFAICT all of the filesystem operations that will modify
> quota call dquot_initialize() directly to attach the required dquots
> to the inode before the operation is started. If that's true, then
> why does quota-on need to do this for all the inodes that are
> already in cache?

This is again for handling quotaon on already mounted filesystem. We
initialize quotas for the inode when opening a file so if some files are
already open when we do quotaon, we want to attach quota structures to
these inodes. I think this was kind of important to limit mismatch between
real usage and accounted usage when old style quotas were used e.g. for
root filesystem but to be fair this code was there when I became quota
maintainer in 1999 and I never dared to remove it :)

> > Now for the sake of
> > full context I'll also say that enabling / disabling quotas on a mounted
> > filesystem is a legacy feature because it is quite easy that quota
> > accounting goes wrong with it. So ext4 and f2fs support for quite a few
> > years a mode where quota tracking is enabled on mount and disabled on
> > unmount (if appropriate fs feature is enabled) and you can only enable /
> > disable enforcement of quota limits during runtime.
> 
> Sure, this is how XFS works, too. But I think this behaviour is
> largely irrelevant because there are still filesystems out there
> that do stuff the old way...
> 
> > So I could see us
> > deprecating this functionality altogether although jfs never adapted to
> > this new way we do quotas so we'd have to deal with that somehow.  But one
> > way or another it would take a significant amount of time before we can
> > completely remove this so it is out of question for this series.
> 
> I'm not sure that matters, though it adds to the reasons why we
> should be removing old, unmaintained filesystems from the tree
> and old, outdated formats from maintained filesystems....

True.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR