On Mon, Mar 29, 2021 at 11:21:34AM +0200, Christian Brauner wrote: > On Sat, Mar 27, 2021 at 11:33:37PM +0000, Al Viro wrote: > > On Fri, Mar 26, 2021 at 02:50:11PM +0100, Christian Brauner wrote: > > > @@ -632,6 +632,7 @@ EXPORT_SYMBOL(close_fd); /* for ksys_close() */ > > > static inline void __range_cloexec(struct files_struct *cur_fds, > > > unsigned int fd, unsigned int max_fd) > > > { > > > + unsigned int cur_max; > > > struct fdtable *fdt; > > > > > > if (fd > max_fd) > > > @@ -639,7 +640,12 @@ static inline void __range_cloexec(struct files_struct *cur_fds, > > > > > > spin_lock(&cur_fds->file_lock); > > > fdt = files_fdtable(cur_fds); > > > - bitmap_set(fdt->close_on_exec, fd, max_fd - fd + 1); > > > + /* make very sure we're using the correct maximum value */ > > > + cur_max = fdt->max_fds; > > > + cur_max--; > > > + cur_max = min(max_fd, cur_max); > > > + if (fd <= cur_max) > > > + bitmap_set(fdt->close_on_exec, fd, cur_max - fd + 1); > > > spin_unlock(&cur_fds->file_lock); > > > } > > > > Umm... That's harder to follow than it ought to be. What's the point of > > having > > max_fd = min(max_fd, cur_max); > > done in the caller, anyway? Note that in __range_close() you have to > > compare with re-fetched ->max_fds (look at pick_file()), so... > > Yeah, I'll massage that patch a bit. I wanted to know whether this fixes > the issue first though. > > > > > BTW, I really wonder if the cost of jerking ->file_lock up and down > > in that loop in __range_close() is negligible. What values do we > > Just for the record, I remember you pointing at that originally. Linus > argued that this likely wasn't going to be a problem and that if people > see performance hits we'll optimize. > > > typically get from callers and how sparse does descriptor table tend > > to be for those? > > Weirdly, I can actually somewhat answer that question since I tend to > regularly "survey" large userspace projects I know or am involved in > that adopt new APIs we added just to see how they use it. > > A few users: > 1. crun > https://github.com/containers/crun/blob/a1c0ef1b886ca30c2fb0906c7c43be04b555c52c/src/libcrun/utils.c#L1490 > ret = syscall_close_range (n, UINT_MAX, CLOSE_RANGE_CLOEXEC); > > 2. LXD > https://github.com/lxc/lxd/blob/f12f03a4ba4645892ef6cc167c24da49d1217b02/lxd/main_forkexec.go#L293 > ret = close_range(EXEC_PIPE_FD + 1, UINT_MAX, CLOSE_RANGE_UNSHARE); > > 3. LXC > https://github.com/lxc/lxc/blob/1718e6d6018d5d6072a01d92a11d5aafc314f98f/src/lxc/rexec.c#L165 > ret = close_range(STDERR_FILENO + 1, MAX_FILENO, CLOSE_RANGE_CLOEXEC); > > Of these three 1. and 3. don't matter because they rely on > CLOSE_RANGE_CLOEXEC and exec. > For 2. I can say that the fdtable is likely going to be sparse. > close_range() here is basically used to prevent accidental fd leaks > across an exec. So 2. should never have more > 4 file. In fact, this > could and should probably be switched to CLOSE_RANGE_CLOEXEC too. > > The next two cases might be more interesting: > > 4. systemd > - https://github.com/systemd/systemd/blob/fe96c0f86d15e844d74d539c6cff7f971078cf84/src/basic/fd-util.c#L228 > close_range(3, -1, 0) > - https://github.com/systemd/systemd/blob/fe96c0f86d15e844d74d539c6cff7f971078cf84/src/basic/fd-util.c#L271 > https://github.com/systemd/systemd/blob/fe96c0f86d15e844d74d539c6cff7f971078cf84/src/basic/fd-util.c#L288 > /* Close everything between the start and end fds (both of which shall stay open) */ > if (close_range(start + 1, end - 1, 0) < 0) { > if (close_range(sorted[n_sorted-1] + 1, -1, 0) >= 0) > > 5. Python > https://github.com/python/cpython/blob/9976834f807ea63ca51bc4f89be457d734148682/Python/fileutils.c#L2250 > > systemd has the regular case that others have too where it simply closes > all fds over 3 and it also has the more complicated case where it has an > ordered array of fds closing up to the lower bound and after the upper > bound up to the maximum. PID 1 can have a large number of fds open > because of socket activation so here close_range() will encounter less > sparse fd tables where it needs to close a lot of fds. > > For Python's os.closerange() implementation which depends on our syscall > it's harder to say given that this will be used by a lot of projects but > I would _guess_ that if people use closerange() they do so because they > actually have something to close. > > In short, I would think that close_range() without the > CLOSE_RANGE_CLOEXEC feature will usually be used in scenarios where > there's work to be done, i.e. where the caller likely knows that they > might inherit a non-trivial number of file descriptors (usually after a > fork) that they want to close and they want to do it either because they > don't exec or they don't know when they'll exec. All others I'd expect > to switch to CLOSE_RANGE_CLOEXEC on kernels where it's supported. The patch below is a bit noiser then I'd like but it rips out the double logic where we check min() twice and should also tweak the worst-case where we keep taking the spinlock for a few fds that are already past fdt->max_fds. Afaict, this should be a very rare (pathological?) case. Even before this change the logic would've capped max_fds to fdt->max_fds but wouldn't have bothered to recheck within the pick_file() loop. I've tweaked it to do that and then to return as soon as we see that we're past the last fd. If you think that's worth then I'm happy to write a commit message. I'm not sure if it's worth doing something more fancy than this tbh but curious to hear what you think (Only compile tested for now.): --- fs/file.c | 72 +++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 49 insertions(+), 23 deletions(-) diff --git a/fs/file.c b/fs/file.c index f3a4bac2cbe9..5027cd75ec59 100644 --- a/fs/file.c +++ b/fs/file.c @@ -596,18 +596,32 @@ void fd_install(unsigned int fd, struct file *file) EXPORT_SYMBOL(fd_install); +/** + * pick_file - return file associatd with fd + * @files: file struct to retrieve file from + * @fd: file descriptor to retrieve file for + * + * If this functions returns an EINVAL error pointer the fd was beyond the + * current maximum number of file descriptors for that fdtable. + * + * Returns: The file associated with @fd, on error returns an error pointer. + */ static struct file *pick_file(struct files_struct *files, unsigned fd) { - struct file *file = NULL; + struct file *file; struct fdtable *fdt; spin_lock(&files->file_lock); fdt = files_fdtable(files); - if (fd >= fdt->max_fds) + if (fd >= fdt->max_fds) { + file = ERR_PTR(-EINVAL); goto out_unlock; + } file = fdt->fd[fd]; - if (!file) + if (!file) { + file = ERR_PTR(-EBADF); goto out_unlock; + } rcu_assign_pointer(fdt->fd[fd], NULL); __put_unused_fd(files, fd); @@ -622,24 +636,37 @@ int close_fd(unsigned fd) struct file *file; file = pick_file(files, fd); - if (!file) + if (IS_ERR(file)) return -EBADF; return filp_close(file, files); } EXPORT_SYMBOL(close_fd); /* for ksys_close() */ +/** + * last_fd - return last valid index into fd table + * @cur_fds: files struct + * + * Context: Either rcu read lock or files_lock must be held. + * + * Returns: Last valid index into fdtable. + */ +static inline unsigned last_fd(struct fdtable *fdt) +{ + return fdt->max_fds - 1; +} + static inline void __range_cloexec(struct files_struct *cur_fds, unsigned int fd, unsigned int max_fd) { struct fdtable *fdt; - if (fd > max_fd) - return; - + /* make sure we're using the correct maximum value */ spin_lock(&cur_fds->file_lock); fdt = files_fdtable(cur_fds); - bitmap_set(fdt->close_on_exec, fd, max_fd - fd + 1); + max_fd = min(last_fd(fdt), max_fd); + if (fd <= max_fd) + bitmap_set(fdt->close_on_exec, fd, max_fd - fd + 1); spin_unlock(&cur_fds->file_lock); } @@ -650,11 +677,16 @@ static inline void __range_close(struct files_struct *cur_fds, unsigned int fd, struct file *file; file = pick_file(cur_fds, fd++); - if (!file) + if (!IS_ERR(file)) { + /* found a valid file to close */ + filp_close(file, cur_fds); + cond_resched(); continue; + } - filp_close(file, cur_fds); - cond_resched(); + /* beyond the last fd in that table */ + if (PTR_ERR(file) == -EINVAL) + return; } } @@ -669,7 +701,6 @@ static inline void __range_close(struct files_struct *cur_fds, unsigned int fd, */ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags) { - unsigned int cur_max; struct task_struct *me = current; struct files_struct *cur_fds = me->files, *fds = NULL; @@ -679,13 +710,6 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags) if (fd > max_fd) return -EINVAL; - rcu_read_lock(); - cur_max = files_fdtable(cur_fds)->max_fds; - rcu_read_unlock(); - - /* cap to last valid index into fdtable */ - cur_max--; - if (flags & CLOSE_RANGE_UNSHARE) { int ret; unsigned int max_unshare_fds = NR_OPEN_MAX; @@ -697,8 +721,12 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags) * If the caller requested all fds to be made cloexec copy all * of the file descriptors since they still want to use them. */ - if (!(flags & CLOSE_RANGE_CLOEXEC) && (max_fd >= cur_max)) - max_unshare_fds = fd; + if (!(flags & CLOSE_RANGE_CLOEXEC)) { + rcu_read_lock(); + if (max_fd >= last_fd(files_fdtable(cur_fds))) + max_unshare_fds = fd; + rcu_read_unlock(); + } ret = unshare_fd(CLONE_FILES, max_unshare_fds, &fds); if (ret) @@ -712,8 +740,6 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags) swap(cur_fds, fds); } - max_fd = min(max_fd, cur_max); - if (flags & CLOSE_RANGE_CLOEXEC) __range_cloexec(cur_fds, fd, max_fd); else -- 2.27.0