Hi Linus, After the recent controversy over the S_KERNEL_FILE flag, I wonder if it should be split into two different flags with different functions and have more appropriate names given to reflect this to make its function easier to understand. I've put in some not particulary great suggestions as to the naming, but something better might suggest itself to others. I've also left the flags generic as there's nothing specifically about them that means they couldn't be used by other kernel drivers too. Hopefully, I've also done a better job of explaining the issues (including a couple of races) the flags fix in the attached patch. Amir's suggestion of moving the no-remove check into may_delete() unfortunately doesn't work because the flag forbidding the VFS op (eg. rmdir) needs to be done with the lock held, to stop the flag from being changed until the VFS op is complete. David --- Split S_KERNEL_FILE into two separate flags to do two separate jobs and give them new names[1][2]: (1) S_INUSE. This allows a kernel service to mark an inode as in-use by that kernel service. This is then used by cachefiles to indicate that it is using a file or directory. Cachefiles uses this for three things: (a) Defending against double-use of part of the cache due to bugs, races or misconfiguration - which could lead to data corruption. (b) As a fast way to tell whether a file is in use when cachefilesd asks to cull it (culling is offloaded to userspace). Previously, the inode details were used to look up in a big tree of records - but doing the same job with a single bit is less expensive. (c) To stop a file that we've agreed cachefilesd may cull from being reused by the cache. This fixes a race between the cull request and the object being buried because we have to drop the inode locks that we've taken so that we can call functions like vfs_unlink(). The race gives a window in which the object can get looked up again - but if the file is in a cullable state, there's nowhere to make a note that it needs to be got rid of except the backing inode. (2) S_NOREMOVE. If this is set, the file or directory may not be removed, renamed or unlinked. This can then be used by cachefiles to prevent userspace removing or renaming files and directories that the are being used. The directory layout in its cache is very important to cachefiles as the names are how it indexes the contents. Removing objects can cause cachefilesd to malfunction as it may find it can't reach bits of the structure that it previously created and still has dentry pointers to. This also closes a race between cachefilesd trying to cull an empty directory and cachefiles trying to create something in it. Amir Goldstein suggested that the check in vfs_rmdir() should be moved to may_delete()[1], but it really needs to be done whilst the inode lock is held. Both of these flags can only be changed under the inode lock. I would recommend that S_NOREMOVE only be set/cleared if S_INUSE is first set. Note that potential usage of these flags is by no means limited to cachefiles, which is why they've been left as general. S_INUSE, for example, if set by one kernel service, will cause another kernel service that uses the flag in the same way to abort that use. Note also that S_NOREMOVE will prevent vfs_unlink() vfs_rmdir() and vfs_rename() from operating on a file. Also define IS_xxx() macros for the above flags[3]. Signed-off-by: David Howells <dhowells@xxxxxxxxxx> cc: linux-cachefs@xxxxxxxxxx Link: https://lore.kernel.org/r/CAOQ4uxjEcvffv=rNXS-r+NLz+=6yk4abRuX_AMq9v-M4nf_PtA@xxxxxxxxxxxxxx [1] Link: https://lore.kernel.org/r/Ydvl8Dk8z0mF0KFl@xxxxxxxxxxxxx/ [2] Link: https://lore.kernel.org/r/88d7f8970dcc0fd0ead891b1f42f160b8d17d60e.camel@xxxxxxxxxx/ [3] --- fs/cachefiles/namei.c | 18 ++++++++++-------- fs/namei.c | 8 +++++--- include/linux/fs.h | 5 ++++- 3 files changed, 19 insertions(+), 12 deletions(-) diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c index f256c8aff7bb..30b7b71158c4 100644 --- a/fs/cachefiles/namei.c +++ b/fs/cachefiles/namei.c @@ -20,8 +20,8 @@ static bool __cachefiles_mark_inode_in_use(struct cachefiles_object *object, struct inode *inode = d_backing_inode(dentry); bool can_use = false; - if (!(inode->i_flags & S_KERNEL_FILE)) { - inode->i_flags |= S_KERNEL_FILE; + if (!IS_INUSE(inode)) { + inode->i_flags |= S_INUSE | S_NOREMOVE; trace_cachefiles_mark_active(object, inode); can_use = true; } else { @@ -53,7 +53,7 @@ static void __cachefiles_unmark_inode_in_use(struct cachefiles_object *object, { struct inode *inode = d_backing_inode(dentry); - inode->i_flags &= ~S_KERNEL_FILE; + inode->i_flags &= ~(S_INUSE | S_NOREMOVE); trace_cachefiles_mark_inactive(object, inode); } @@ -392,8 +392,10 @@ int cachefiles_bury_object(struct cachefiles_cache *cache, }; trace_cachefiles_rename(object, d_inode(rep)->i_ino, why); ret = cachefiles_inject_read_error(); - if (ret == 0) + if (ret == 0) { + __cachefiles_unmark_inode_in_use(object, rep); ret = vfs_rename(&rd); + } if (ret != 0) trace_cachefiles_vfs_error(object, d_inode(dir), ret, cachefiles_trace_rename_error); @@ -402,7 +404,6 @@ int cachefiles_bury_object(struct cachefiles_cache *cache, "Rename failed with error %d", ret); } - __cachefiles_unmark_inode_in_use(object, rep); unlock_rename(cache->graveyard, dir); dput(grave); _leave(" = 0"); @@ -426,6 +427,7 @@ int cachefiles_delete_object(struct cachefiles_object *object, dget(dentry); inode_lock_nested(d_backing_inode(fan), I_MUTEX_PARENT); + cachefiles_unmark_inode_in_use(object, object->file); ret = cachefiles_unlink(volume->cache, object, fan, dentry, why); inode_unlock(d_backing_inode(fan)); dput(dentry); @@ -746,7 +748,7 @@ static struct dentry *cachefiles_lookup_for_cull(struct cachefiles_cache *cache, goto lookup_error; if (d_is_negative(victim)) goto lookup_put; - if (d_inode(victim)->i_flags & S_KERNEL_FILE) + if (IS_INUSE(d_inode(victim))) goto lookup_busy; return victim; @@ -793,11 +795,11 @@ int cachefiles_cull(struct cachefiles_cache *cache, struct dentry *dir, /* check to see if someone is using this object */ inode = d_inode(victim); inode_lock(inode); - if (inode->i_flags & S_KERNEL_FILE) { + if (IS_INUSE(inode)) { ret = -EBUSY; } else { /* Stop the cache from picking it back up */ - inode->i_flags |= S_KERNEL_FILE; + inode->i_flags |= S_INUSE; ret = 0; } inode_unlock(inode); diff --git a/fs/namei.c b/fs/namei.c index d81f04f8d818..99eddc41f6aa 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3959,7 +3959,7 @@ int vfs_rmdir(struct user_namespace *mnt_userns, struct inode *dir, error = -EBUSY; if (is_local_mountpoint(dentry) || - (dentry->d_inode->i_flags & S_KERNEL_FILE)) + IS_NOREMOVE(dentry->d_inode)) goto out; error = security_inode_rmdir(dir, dentry); @@ -4090,7 +4090,8 @@ int vfs_unlink(struct user_namespace *mnt_userns, struct inode *dir, inode_lock(target); if (IS_SWAPFILE(target)) error = -EPERM; - else if (is_local_mountpoint(dentry)) + else if (is_local_mountpoint(dentry) || + IS_NOREMOVE(dentry->d_inode)) error = -EBUSY; else { error = security_inode_unlink(dir, dentry); @@ -4603,7 +4604,8 @@ int vfs_rename(struct renamedata *rd) goto out; error = -EBUSY; - if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry)) + if (is_local_mountpoint(old_dentry) || is_local_mountpoint(new_dentry) || + IS_NOREMOVE(old_dentry->d_inode)) goto out; if (max_links && new_dir != old_dir) { diff --git a/include/linux/fs.h b/include/linux/fs.h index f5d3bf5b69a6..68cae4aaa6ff 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2173,7 +2173,8 @@ struct super_operations { #define S_ENCRYPTED (1 << 14) /* Encrypted file (using fs/crypto/) */ #define S_CASEFOLD (1 << 15) /* Casefolded file */ #define S_VERITY (1 << 16) /* Verity file (using fs/verity/) */ -#define S_KERNEL_FILE (1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */ +#define S_INUSE (1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */ +#define S_NOREMOVE (1 << 18) /* File is not to be removed or renamed */ /* * Note that nosuid etc flags are inode-specific: setting some file-system @@ -2216,6 +2217,8 @@ static inline bool sb_rdonly(const struct super_block *sb) { return sb->s_flags #define IS_ENCRYPTED(inode) ((inode)->i_flags & S_ENCRYPTED) #define IS_CASEFOLDED(inode) ((inode)->i_flags & S_CASEFOLD) #define IS_VERITY(inode) ((inode)->i_flags & S_VERITY) +#define IS_INUSE(inode) ((inode)->i_flags & S_INUSE) +#define IS_NOREMOVE(inode) ((inode)->i_flags & S_NOREMOVE) #define IS_WHITEOUT(inode) (S_ISCHR(inode->i_mode) && \ (inode)->i_rdev == WHITEOUT_DEV)