On Sat, 5 Aug 2023 at 11:47, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > Yes. Except I hate having two of those iterate functions. We should > have gotten rid of them absolutely years ago. > > You shamed me into fixing that. Ok, final build test done and extensive 'git grep' to make sure I didn't miss anything. Here's two patches to finally get rid of the old legacy '->iterate()' case entirely from the file ops structure. Yes, some filesystems then still get the inode lock in write mode, but now it's the filesystem itself that wraps its own iterator, rather than cause pain for the callers. So no more "Do you have an ->iterate() _or_ ->iterate_shared() function?" and associated "I need to do locking differently" nastiness. Only odd filesystems that never got the memo on "don't use .iterate". I looked at some of the filesystems I added the wrapper to, and they looked like maybe they would be perfectly happy to just iterate with the lock held shared. But this really was meant to be a completely mindless conversion with no semantic changes. Adding some filesystem maintainers where that second patch adds the wrapper use. It would be lovely if in another seven years we could get rid of that wrapper too ;) The patches are entirely untested, but they do build cleanly for me, and look very straightforward. The wrapper function is trivial, and is mostly comments. Anybody? Linus
From f0c4ccfd179c301623151c49694e1061ca3f2ed2 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Date: Sat, 5 Aug 2023 10:49:31 -0700 Subject: [PATCH 1/2] proc: fix missing conversion to 'iterate_shared' I'm looking at the directory handling due to the discussion about f_pos locking (see commit 797964253d35: "file: reinstate f_pos locking optimization for regular files"), and wanting to clean that up. And one source of ugliness is how we were supposed to move filesystems over to the '->iterate_shared()' function that only takes the inode lock for reading many many years ago, but several filesystems still use the bad old '->iterate()' that takes the inode lock for exclusive access. See commit 6192269444eb ("introduce a parallel variant of ->iterate()") that also added some documentation stating Old method is only used if the new one is absent; eventually it will be removed. Switch while you still can; the old one won't stay. and that was back in April 2016. Here we are, many years later, and the old version is still clearly sadly alive and well. Now, some of those old style iterators are probably just because the filesystem may end up having per-inode mutable data that it uses for iterating a directory, but at least one case is just a mistake. Al switched over most filesystems to use '->iterate_shared()' back when it was introduced. In particular, the /proc filesystem was converted as one of the first ones in commit f50752eaa0b0 ("switch all procfs directories ->iterate_shared()"). But then later one new user of '->iterate()' was then re-introduced by commit 6d9c939dbe4d ("procfs: add smack subdir to attrs"). And that's clearly not what we wanted, since that new case just uses the same 'proc_pident_readdir()' and 'proc_pident_lookup()' helper functions that other /proc pident directories use, and they are most definitely safe to use with the inode lock held shared. So just fix it. This still leaves a fair number of oddball filesystems using the old-style directory iterator (ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, and vboxsf), but at least we don't have any remaining in the core filesystems. I'm going to add a wrapper function that just drops the read-lock and takes it as a write lock, so that we can clean up the core vfs layer and make all the ugly 'this filesystem needs exclusive inode locking' be just filesystem-internal warts. I just didn't want to make that conversion when we still had a core user left. Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> --- fs/proc/base.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 05452c3b9872..9df3f4839662 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2817,7 +2817,7 @@ static int proc_##LSM##_attr_dir_iterate(struct file *filp, \ \ static const struct file_operations proc_##LSM##_attr_dir_ops = { \ .read = generic_read_dir, \ - .iterate = proc_##LSM##_attr_dir_iterate, \ + .iterate_shared = proc_##LSM##_attr_dir_iterate, \ .llseek = default_llseek, \ }; \ \ -- 2.41.0.203.ga4f2cd32bb.dirty
From f8d841c7b183e24148462cc0cca4a76c74af2aa1 Mon Sep 17 00:00:00 2001 From: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> Date: Sat, 5 Aug 2023 12:25:01 -0700 Subject: [PATCH 2/2] vfs: get rid of old '->iterate' directory operation All users now just use '->iterate_shared()', which only takes the directory inode lock for reading. Filesystems that never got convered to shared mode now instead use a wrapper that drops the lock, re-takes it in write mode, calls the old function, and then downgrades the lock back to read mode. This way the VFS layer and other callers no longer need to care about filesystems that never got converted to the modern era. The filesystems that use the new wrapper are ceph, coda, exfat, jfs, ntfs, ocfs2, overlayfs, and vboxsf. Honestly, several of them look like they really could just iterate their directories in shared mode and skip the wrapper entirely, but the point of this change is to not change semantics or fix filesystems that haven't been fixed in the last 7+ years, but to finally get rid of the dual iterators. Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> --- Documentation/filesystems/locking.rst | 5 +- Documentation/filesystems/porting.rst | 25 ++++------ fs/ceph/dir.c | 5 +- fs/coda/dir.c | 20 +++----- fs/exfat/dir.c | 3 +- fs/exportfs/expfs.c | 2 +- fs/jfs/namei.c | 3 +- fs/ntfs/dir.c | 3 +- fs/ocfs2/file.c | 5 +- fs/overlayfs/readdir.c | 3 +- fs/readdir.c | 68 ++++++++++++++++++++------- fs/vboxsf/dir.c | 3 +- include/linux/fs.h | 8 +++- 13 files changed, 95 insertions(+), 58 deletions(-) diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index ed148919e11a..0ca479dbb1cd 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -551,9 +551,8 @@ mutex or just to use i_size_read() instead. Note: this does not protect the file->f_pos against concurrent modifications since this is something the userspace has to take care about. -->iterate() is called with i_rwsem exclusive. - -->iterate_shared() is called with i_rwsem at least shared. +->iterate_shared() is called with i_rwsem held for reading, and with the +file f_pos_lock held exclusively ->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags. Most instances call fasync_helper(), which does that maintenance, so it's diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index d2d684ae7798..0f5da78ef4f9 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -537,7 +537,7 @@ vfs_readdir() is gone; switch to iterate_dir() instead **mandatory** -->readdir() is gone now; switch to ->iterate() +->readdir() is gone now; switch to ->iterate_shared() **mandatory** @@ -693,24 +693,19 @@ parallel now. --- -**recommended** +**mandatory** -->iterate_shared() is added; it's a parallel variant of ->iterate(). +->iterate_shared() is added. Exclusion on struct file level is still provided (as well as that between it and lseek on the same struct file), but if your directory has been opened several times, you can get these called in parallel. Exclusion between that method and all directory-modifying ones is still provided, of course. -Often enough ->iterate() can serve as ->iterate_shared() without any -changes - it is a read-only operation, after all. If you have any -per-inode or per-dentry in-core data structures modified by ->iterate(), -you might need something to serialize the access to them. If you -do dcache pre-seeding, you'll need to switch to d_alloc_parallel() for -that; look for in-tree examples. - -Old method is only used if the new one is absent; eventually it will -be removed. Switch while you still can; the old one won't stay. +If you have any per-inode or per-dentry in-core data structures modified +by ->iterate_shared(), you might need something to serialize the access +to them. If you do dcache pre-seeding, you'll need to switch to +d_alloc_parallel() for that; look for in-tree examples. --- @@ -930,9 +925,9 @@ should be done by looking at FMODE_LSEEK in file->f_mode. filldir_t (readdir callbacks) calling conventions have changed. Instead of returning 0 or -E... it returns bool now. false means "no more" (as -E... used to) and true - "keep going" (as 0 in old calling conventions). Rationale: -callers never looked at specific -E... values anyway. ->iterate() and -->iterate_shared() instance require no changes at all, all filldir_t ones in -the tree converted. +callers never looked at specific -E... values anyway. -> iterate_shared() +instances require no changes at all, all filldir_t ones in the tree +converted. --- diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 4a2b39d9a61a..bdcffb04513f 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -2019,9 +2019,10 @@ unsigned ceph_dentry_hash(struct inode *dir, struct dentry *dn) } } +WRAP_DIR_ITER(ceph_readdir) // FIXME! const struct file_operations ceph_dir_fops = { .read = ceph_read_dir, - .iterate = ceph_readdir, + .iterate_shared = shared_ceph_readdir, .llseek = ceph_dir_llseek, .open = ceph_open, .release = ceph_release, @@ -2033,7 +2034,7 @@ const struct file_operations ceph_dir_fops = { }; const struct file_operations ceph_snapdir_fops = { - .iterate = ceph_readdir, + .iterate_shared = shared_ceph_readdir, .llseek = ceph_dir_llseek, .open = ceph_open, .release = ceph_release, diff --git a/fs/coda/dir.c b/fs/coda/dir.c index 8450b1bd354b..1b960de2bf39 100644 --- a/fs/coda/dir.c +++ b/fs/coda/dir.c @@ -429,21 +429,14 @@ static int coda_readdir(struct file *coda_file, struct dir_context *ctx) cfi = coda_ftoc(coda_file); host_file = cfi->cfi_container; - if (host_file->f_op->iterate || host_file->f_op->iterate_shared) { + if (host_file->f_op->iterate_shared) { struct inode *host_inode = file_inode(host_file); ret = -ENOENT; if (!IS_DEADDIR(host_inode)) { - if (host_file->f_op->iterate_shared) { - inode_lock_shared(host_inode); - ret = host_file->f_op->iterate_shared(host_file, ctx); - file_accessed(host_file); - inode_unlock_shared(host_inode); - } else { - inode_lock(host_inode); - ret = host_file->f_op->iterate(host_file, ctx); - file_accessed(host_file); - inode_unlock(host_inode); - } + inode_lock_shared(host_inode); + ret = host_file->f_op->iterate_shared(host_file, ctx); + file_accessed(host_file); + inode_unlock_shared(host_inode); } return ret; } @@ -585,10 +578,11 @@ const struct inode_operations coda_dir_inode_operations = { .setattr = coda_setattr, }; +WRAP_DIR_ITER(coda_readdir) // FIXME! const struct file_operations coda_dir_operations = { .llseek = generic_file_llseek, .read = generic_read_dir, - .iterate = coda_readdir, + .iterate_shared = shared_coda_readdir, .open = coda_open, .release = coda_release, .fsync = coda_fsync, diff --git a/fs/exfat/dir.c b/fs/exfat/dir.c index 598081d0d059..e1586bba6d86 100644 --- a/fs/exfat/dir.c +++ b/fs/exfat/dir.c @@ -306,10 +306,11 @@ static int exfat_iterate(struct file *file, struct dir_context *ctx) return err; } +WRAP_DIR_ITER(exfat_iterate) // FIXME! const struct file_operations exfat_dir_operations = { .llseek = generic_file_llseek, .read = generic_read_dir, - .iterate = exfat_iterate, + .iterate_shared = shared_exfat_iterate, .unlocked_ioctl = exfat_ioctl, #ifdef CONFIG_COMPAT .compat_ioctl = exfat_compat_ioctl, diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c index 40e624cf7e92..d1dbe47c7975 100644 --- a/fs/exportfs/expfs.c +++ b/fs/exportfs/expfs.c @@ -315,7 +315,7 @@ static int get_name(const struct path *path, char *name, struct dentry *child) goto out; error = -EINVAL; - if (!file->f_op->iterate && !file->f_op->iterate_shared) + if (!file->f_op->iterate_shared) goto out_close; buffer.sequence = 0; diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c index 9b030297aa64..e98ddb2b1cf2 100644 --- a/fs/jfs/namei.c +++ b/fs/jfs/namei.c @@ -1535,9 +1535,10 @@ const struct inode_operations jfs_dir_inode_operations = { #endif }; +WRAP_DIR_ITER(jfs_readdir) // FIXME! const struct file_operations jfs_dir_operations = { .read = generic_read_dir, - .iterate = jfs_readdir, + .iterate_shared = shared_jfs_readdir, .fsync = jfs_fsync, .unlocked_ioctl = jfs_ioctl, .compat_ioctl = compat_ptr_ioctl, diff --git a/fs/ntfs/dir.c b/fs/ntfs/dir.c index 518c3a21a556..4596c90e7b7c 100644 --- a/fs/ntfs/dir.c +++ b/fs/ntfs/dir.c @@ -1525,10 +1525,11 @@ static int ntfs_dir_fsync(struct file *filp, loff_t start, loff_t end, #endif /* NTFS_RW */ +WRAP_DIR_ITER(ntfs_readdir) // FIXME! const struct file_operations ntfs_dir_ops = { .llseek = generic_file_llseek, /* Seek inside directory. */ .read = generic_read_dir, /* Return -EISDIR. */ - .iterate = ntfs_readdir, /* Read directory contents. */ + .iterate_shared = shared_ntfs_readdir, /* Read directory contents. */ #ifdef NTFS_RW .fsync = ntfs_dir_fsync, /* Sync a directory to disk. */ #endif /* NTFS_RW */ diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 91a194596552..bf2c17ea96a0 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -2793,10 +2793,11 @@ const struct file_operations ocfs2_fops = { .remap_file_range = ocfs2_remap_file_range, }; +WRAP_DIR_ITER(ocfs2_readdir) // FIXME! const struct file_operations ocfs2_dops = { .llseek = generic_file_llseek, .read = generic_read_dir, - .iterate = ocfs2_readdir, + .iterate_shared = shared_ocfs2_readdir, .fsync = ocfs2_sync_file, .release = ocfs2_dir_release, .open = ocfs2_dir_open, @@ -2842,7 +2843,7 @@ const struct file_operations ocfs2_fops_no_plocks = { const struct file_operations ocfs2_dops_no_plocks = { .llseek = generic_file_llseek, .read = generic_read_dir, - .iterate = ocfs2_readdir, + .iterate_shared = shared_ocfs2_readdir, .fsync = ocfs2_sync_file, .release = ocfs2_dir_release, .open = ocfs2_dir_open, diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c index ee5c4736480f..de39e067ae65 100644 --- a/fs/overlayfs/readdir.c +++ b/fs/overlayfs/readdir.c @@ -954,10 +954,11 @@ static int ovl_dir_open(struct inode *inode, struct file *file) return 0; } +WRAP_DIR_ITER(ovl_iterate) // FIXME! const struct file_operations ovl_dir_operations = { .read = generic_read_dir, .open = ovl_dir_open, - .iterate = ovl_iterate, + .iterate_shared = shared_ovl_iterate, .llseek = ovl_dir_llseek, .fsync = ovl_dir_fsync, .release = ovl_dir_release, diff --git a/fs/readdir.c b/fs/readdir.c index b264ce60114d..c8c46e294431 100644 --- a/fs/readdir.c +++ b/fs/readdir.c @@ -24,6 +24,53 @@ #include <asm/unaligned.h> +/* + * Some filesystems were never converted to '->iterate_shared()' + * and their directory iterators want the inode lock held for + * writing. This wrapper allows for converting from the shared + * semantics to the exclusive inode use. + */ +int wrap_directory_iterator(struct file *file, + struct dir_context *ctx, + int (*iter)(struct file *, struct dir_context *)) +{ + struct inode *inode = file_inode(file); + int ret; + + /* + * We'd love to have an 'inode_upgrade_trylock()' operation, + * see the comment in mmap_upgrade_trylock() in mm/memory.c. + * + * But considering this is for "filesystems that never got + * converted", it really doesn't matter. + * + * Also note that since we have to return with the lock held + * for reading, we can't use the "killable()" locking here, + * since we do need to get the lock even if we're dying. + * + * We could do the write part killably and then get the read + * lock unconditionally if it mattered, but see above on why + * this does the very simplistic conversion. + */ + up_read(&inode->i_rwsem); + down_write(&inode->i_rwsem); + + /* + * Since we dropped the inode lock, we should do the + * DEADDIR test again. See 'iterate_dir()' below. + * + * Note that we don't need to re-do the f_pos games, + * since the file must be locked wrt f_pos anyway. + */ + ret = -ENOENT; + if (!IS_DEADDIR(inode)) + ret = iter(file, ctx); + + downgrade_write(&inode->i_rwsem); + return ret; +} +EXPORT_SYMBOL(wrap_directory_iterator); + /* * Note the "unsafe_put_user() semantics: we goto a * label for errors. @@ -40,39 +87,28 @@ int iterate_dir(struct file *file, struct dir_context *ctx) { struct inode *inode = file_inode(file); - bool shared = false; int res = -ENOTDIR; - if (file->f_op->iterate_shared) - shared = true; - else if (!file->f_op->iterate) + + if (!file->f_op->iterate_shared) goto out; res = security_file_permission(file, MAY_READ); if (res) goto out; - if (shared) - res = down_read_killable(&inode->i_rwsem); - else - res = down_write_killable(&inode->i_rwsem); + res = down_read_killable(&inode->i_rwsem); if (res) goto out; res = -ENOENT; if (!IS_DEADDIR(inode)) { ctx->pos = file->f_pos; - if (shared) - res = file->f_op->iterate_shared(file, ctx); - else - res = file->f_op->iterate(file, ctx); + res = file->f_op->iterate_shared(file, ctx); file->f_pos = ctx->pos; fsnotify_access(file); file_accessed(file); } - if (shared) - inode_unlock_shared(inode); - else - inode_unlock(inode); + inode_unlock_shared(inode); out: return res; } diff --git a/fs/vboxsf/dir.c b/fs/vboxsf/dir.c index 075f15c43c78..5f1a14d5b927 100644 --- a/fs/vboxsf/dir.c +++ b/fs/vboxsf/dir.c @@ -179,9 +179,10 @@ static int vboxsf_dir_iterate(struct file *dir, struct dir_context *ctx) return 0; } +WRAP_DIR_ITER(vboxsf_dir_iterate) // FIXME! const struct file_operations vboxsf_dir_fops = { .open = vboxsf_dir_open, - .iterate = vboxsf_dir_iterate, + .iterate_shared = shared_vboxsf_dir_iterate, .release = vboxsf_dir_release, .read = generic_read_dir, .llseek = generic_file_llseek, diff --git a/include/linux/fs.h b/include/linux/fs.h index 6867512907d6..562f2623c9c9 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1780,7 +1780,6 @@ struct file_operations { ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); int (*iopoll)(struct kiocb *kiocb, struct io_comp_batch *, unsigned int flags); - int (*iterate) (struct file *, struct dir_context *); int (*iterate_shared) (struct file *, struct dir_context *); __poll_t (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); @@ -1817,6 +1816,13 @@ struct file_operations { unsigned int poll_flags); } __randomize_layout; +/* Wrap a directory iterator that needs exclusive inode access */ +int wrap_directory_iterator(struct file *, struct dir_context *, + int (*) (struct file *, struct dir_context *)); +#define WRAP_DIR_ITER(x) \ + static int shared_##x(struct file *file , struct dir_context *ctx) \ + { return wrap_directory_iterator(file, ctx, x); } + struct inode_operations { struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *); -- 2.41.0.203.ga4f2cd32bb.dirty