On Tue, Oct 8, 2024 at 1:33 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Mon, Oct 07, 2024 at 01:37:19PM -0700, Linus Torvalds wrote: > > On Thu, 3 Oct 2024 at 04:57, Jan Kara <jack@xxxxxxx> wrote: > > > > > > Fair enough. If we go with the iterator variant I've suggested to Dave in > > > [1], we could combine the evict_inodes(), fsnotify_unmount_inodes() and > > > Landlocks hook_sb_delete() into a single iteration relatively easily. But > > > I'd wait with that convertion until this series lands. > > > > Honza, I looked at this a bit more, particularly with an eye of "what > > happens if we just end up making the inode lifetimes subject to the > > dentry lifetimes" as suggested by Dave elsewhere. > > .... > > > which makes the fsnotify_inode_delete() happen when the inode is > > removed from the dentry. > > There may be other inode references being held that make > the inode live longer than the dentry cache. When should the > fsnotify marks be removed from the inode in that case? Do they need > to remain until, e.g, writeback completes? > fsnotify inode marks remain until explicitly removed or until sb is unmounted (*), so other inode references are irrelevant to inode mark removal. (*) fanotify has "evictable" inode marks, which do not hold inode reference and go away on inode evict, but those mark evictions do not generate any event (i.e. there is no FAN_UNMOUNT). > > Then at umount time, the dentry shrinking will deal with all live > > dentries, and at most the fsnotify layer would send the FS_UNMOUNT to > > just the root dentry inodes? > > I don't think even that is necessary, because > shrink_dcache_for_umount() drops the sb->s_root dentry after > trimming the dentry tree. Hence the dcache drop would cleanup all > inode references, roots included. > > > Wouldn't that make things much cleaner, and remove at least *one* odd > > use of the nasty s_inodes list? > > Yes, it would, but someone who knows exactly when the fsnotify > marks can be removed needs to chime in here... > > > I have this feeling that maybe we can just remove the other users too > > using similar models. I think the LSM layer use (in landlock) is bogus > > for exactly the same reason - there's really no reason to keep things > > around for a random cached inode without a dentry. > > Perhaps, but I'm not sure what the landlock code is actually trying > to do. It seems to be trying to avoid races between syscalls > releasing inode references and unmount calling security_sb_delete() > to clean up inode references that it has leaked. This implies that > it's not a) tracking inodes itself, and b) not cleaning up internal > state early enough in unmount. > > Hence, to me, the lifecycle and reference counting of inode related > objects in landlock doesn't seem quite right, and the use of the > security_sb_delete() callout appears to be papering over an internal > lifecycle issue. > > I'd love to get rid of it altogether. > > > And I wonder if the quota code (which uses the s_inodes list to enable > > quotas on already mounted filesystems) could for all the same reasons > > just walk the dentry tree instead (and remove_dquot_ref similarly > > could just remove it at dentry_unlink_inode() time)? > > I don't think that will work because we have to be able to modify > quota in evict() processing. This is especially true for unlinked > inodes being evicted from cache, but also the dquots need to stay > attached until writeback completes. > > Hence I don't think we can remove the quota refs from the inode > before we call iput_final(), and so I think quotaoff (at least) > still needs to iterate inodes... > > > It really feels like most (all?) of the s_inode list users are > > basically historical, and shouldn't use that list at all. And there > > aren't _that_ many of them. I think Dave was right in just saying that > > this list should go away entirely (or was it somebody else who made > > that comment?) > > Yeah, I said that it should go away entirely. > > My view of this whole s_inodes list is that subsystems that are > taking references to inodes *must* track or manage the references to > the inodes themselves. > > The canonical example is the VFS itself: evict_inodes() doesn't need > to iterate s_inodes at all. It can walk the inode LRU to purge all > the unreferenced cached inodes from memory. iput_final() guarantees > that all unreferenced inodes are either put on the LRU or torn down > immediately. > > Hence I think that it is a poor architectural decision to require > superblock teardown to clean up inode references random subsystems > have *leaked* to prevent UAFs. It forces the sb to track all > inodes whether the VFS actually needs to track them or not. > For fsnotify, I think we can/should maintain a list of marked inodes inside sb->s_fsnotify_info, we can iterate this private list in fsnotify_unmount_inodes() to remove the marks. TBH, I am not sure I understand the suggested change for inode lifetime. An inode can have a reference from dentry or from some subsystem (e.g. fsnotify) which is responsible for putting their held reference before unmount. What is the alternative? Thanks, Amir.