Re: lsm sb_delete hook, was Re: [PATCH 4/7] vfs: Convert sb->s_inodes iteration to super_iter_inodes()

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 9 Oct 2024 08:10:30 +0200

On Wed, Oct 9, 2024 at 1:44 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Tue, Oct 08, 2024 at 01:23:44PM +0200, Jan Kara wrote:
> > On Tue 08-10-24 10:57:22, Amir Goldstein wrote:
> > > On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote:
> > > > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@xxxxxxx> wrote:
> > > > > >
> > > > > > Fair enough. If we go with the iterator variant I've suggested to Dave in
> > > > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and
> > > > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But
> > > > > > I'd wait with that convertion until this series lands.
> > > > >
> > > > > Honza, I looked at this a bit more, particularly with an eye of "what
> > > > > happens if we just end up making the inode lifetimes subject to the
> > > > > dentry lifetimes" as suggested by Dave elsewhere.
> > > >
> > > > ....
> > > >
> > > > > which makes the fsnotify_inode_delete() happen when the inode is
> > > > > removed from the dentry.
> > > >
> > > > There may be other inode references being held that make
> > > > the inode live longer than the dentry cache. When should the
> > > > fsnotify marks be removed from the inode in that case? Do they need
> > > > to remain until, e.g, writeback completes?
> > > >
> > >
> > > fsnotify inode marks remain until explicitly removed or until sb
> > > is unmounted (*), so other inode references are irrelevant to
> > > inode mark removal.
> > >
> > > (*) fanotify has "evictable" inode marks, which do not hold inode
> > > reference and go away on inode evict, but those mark evictions
> > > do not generate any event (i.e. there is no FAN_UNMOUNT).
> >
> > Yes. Amir beat me with the response so let me just add that FS_UMOUNT event
> > is for inotify which guarantees that either you get an event about somebody
> > unlinking the inode (e.g. IN_DELETE_SELF) or event about filesystem being
> > unmounted (IN_UMOUNT) if you place mark on some inode. I also don't see how
> > we would maintain this behavior with what Linus proposes.
>
> Thanks. I didn't respond last night when I read Amir's decription
> because I wanted to think it over. Knowing where the unmount event
> requirement certainly helps.
>
> I am probably missing something important, but it really seems to me
> that the object reference counting model is the back to
> front.  Currently the mark is being attached to the inode and then
> the inode pinned by a reference count to make the mark attached
> to the inode persistent until unmount. This then requires the inodes
> to be swept by unmount because fsnotify has effectively leaked them
> as it isn't tracking such inodes itself.
>
> [ Keep in mind that I'm not saying this was a bad or wrong thing to
> do because the s_inodes list was there to be able to do this sort of
> lazy cleanup. But now that we want to remove the s_inodes list if at
> all possible, it is a problem we need to solve differently. ]
>
> AFAICT, inotify does not appear to require the inode to send events
> - it only requires access to the inode mark itself. Hence it does
> not the inode in cache to generate IN_UNMOUNT events, it just
> needs the mark itself to be findable at unmount.  Do any of the
> other backends that require unmount notifications that require
> special access to the inode itself?
>

No other backend supports IN_UNMOUNT/FS_UNMOUNT.
We want to add unmount events support to fanotify, but those are
only going to be possible for watching a mount or an sb, not inodes.

> If not, and the fsnotify sb info is tracking these persistent marks,
> then we don't need to iterate inodes at unmount. This means we don't
> need to pin inodes when they have marks attached, and so the
> dependency on the s_inodes list goes away.
>
> With this inverted model, we need the first fsnotify event callout
> after the inode is instantiated to look for a persistent mark for
> the inode. We know how to do this efficiently - it's exactly the
> same caching model we use for ACLs. On the first lookup, we check
> the inode for ACL data and set the ACL pointer appropriately to
> indicate that a lookup has been done and there are no ACLs
> associated with the inode.
>
> At this point, the fsnotify inode marks can all be removed from the
> inode when it is being evicted and there's no need for fsnotify to
> pin inodes at all.
>
> > > > > Then at umount time, the dentry shrinking will deal with all live
> > > > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to
> > > > > just the root dentry inodes?
> > > >
> > > > I don't think even that is necessary, because
> > > > shrink_dcache_for_umount() drops the sb->s_root dentry after
> > > > trimming the dentry tree. Hence the dcache drop would cleanup all
> > > > inode references, roots included.
> > > >
> > > > > Wouldn't that make things much cleaner, and remove at least *one* odd
> > > > > use of the nasty s_inodes list?
> > > >
> > > > Yes, it would, but someone who knows exactly when the fsnotify
> > > > marks can be removed needs to chime in here...
> >
> > So fsnotify needs a list of inodes for the superblock which have marks
> > attached and for which we hold inode reference. We can keep it inside
> > fsnotify code although it would practically mean another list_head for the
> > inode for this list (probably in our fsnotify_connector structure which
> > connects list of notification marks to the inode).
>
> I don't think that is necessary. We need to get rid of the inode
> reference, not move where we track inode references. The persistent
> object is the fsnotify mark, not the cached inode. It's the mark
> that needs to be persistent, and that's what the fsnotify code
> should be tracking.
>
> The fsnotify marks are much smaller than inodes, and there going to
> be fewer cached marks than inodes, especially once inode pinning is
> removed. Hence I think this will result in a net reduction in memory
> footprint for "marked-until-unmount" configurations as we won't pin
> nearly as many inodes in cache...
>

It is a feasible design which has all the benefits that you listed.
But it is a big change, just to get away from s_inodes
(much easier to maintain a private list of pinned inodes).

inotify (recursive tree watches for that matter) has been
inefficient that way for a long time, and users now have less
memory hogging solutions like fanotify mount and sb marks.
granted, not unprivileged users, but still.

So there needs to be a good justification to make this design change.
One such justification would be to provide the infrastructure to
the feature that Jan referred to as the "holy grail" in his LPC talk,
namely, subtree watches.

If we introduce code that looks up persistent "mark rules" on
inode instantiation, then we could use it to "reconnect" inotify
persistent inode marks (by ino/fid) or to establish automatic
marks based on subtree/path based rules.

audit code has something that resembles this and I suspect that
this Landlock is doing something similar (?), but I didn't check.
path based rules are always going to be elusive and tricky and
Al is always going to hate them ;)

Bottom line - good idea, not easy, requires allocating development resources.

Thanks,
Amir.