On Thu, May 23, 2019 at 07:22:17PM +0300, Konstantin Khlebnikov wrote: > On 22.05.2019 18:52, Christian Brauner wrote:> This adds the close_range() syscall. It allows to efficiently close a range > > of file descriptors up to all file descriptors of a calling task. > > > > The syscall came up in a recent discussion around the new mount API and > > making new file descriptor types cloexec by default. During this > > discussion, Al suggested the close_range() syscall (cf. [1]). Note, a > > syscall in this manner has been requested by various people over time. > > > > First, it helps to close all file descriptors of an exec()ing task. This > > can be done safely via (quoting Al's example from [1] verbatim): > > > > /* that exec is sensitive */ > > unshare(CLONE_FILES); > > /* we don't want anything past stderr here */ > > close_range(3, ~0U); > > execve(....); > > > > The code snippet above is one way of working around the problem that file > > descriptors are not cloexec by default. This is aggravated by the fact that > > we can't just switch them over without massively regressing userspace. For > > a whole class of programs having an in-kernel method of closing all file > > descriptors is very helpful (e.g. demons, service managers, programming > > language standard libraries, container managers etc.). > > (Please note, unshare(CLONE_FILES) should only be needed if the calling > > task is multi-threaded and shares the file descriptor table with another > > thread in which case two threads could race with one thread allocating > > file descriptors and the other one closing them via close_range(). For the > > general case close_range() before the execve() is sufficient.) > > > > Second, it allows userspace to avoid implementing closing all file > > descriptors by parsing through /proc/<pid>/fd/* and calling close() on each > > file descriptor. From looking at various large(ish) userspace code bases > > this or similar patterns are very common in: > > - service managers (cf. [4]) > > - libcs (cf. [6]) > > - container runtimes (cf. [5]) > > - programming language runtimes/standard libraries > > - Python (cf. [2]) > > - Rust (cf. [7], [8]) > > As Dmitry pointed out there's even a long-standing glibc bug about missing > > kernel support for this task (cf. [3]). > > In addition, the syscall will also work for tasks that do not have procfs > > mounted and on kernels that do not have procfs support compiled in. In such > > situations the only way to make sure that all file descriptors are closed > > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE, > > OPEN_MAX trickery (cf. comment [8] on Rust). > > > > The performance is striking. For good measure, comparing the following > > simple close_all_fds() userspace implementation that is essentially just > > glibc's version in [6]: > > > > static int close_all_fds(void) > > { > > int dir_fd; > > DIR *dir; > > struct dirent *direntp; > > > > dir = opendir("/proc/self/fd"); > > if (!dir) > > return -1; > > dir_fd = dirfd(dir); > > while ((direntp = readdir(dir))) { > > int fd; > > if (strcmp(direntp->d_name, ".") == 0) > > continue; > > if (strcmp(direntp->d_name, "..") == 0) > > continue; > > fd = atoi(direntp->d_name); > > if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2) > > continue; > > close(fd); > > } > > closedir(dir); > > return 0; > > } > > > > to close_range() yields: > > 1. closing 4 open files: > > - close_all_fds(): ~280 us > > - close_range(): ~24 us > > > > 2. closing 1000 open files: > > - close_all_fds(): ~5000 us > > - close_range(): ~800 us > > > > close_range() is designed to allow for some flexibility. Specifically, it > > does not simply always close all open file descriptors of a task. Instead, > > callers can specify an upper bound. > > This is e.g. useful for scenarios where specific file descriptors are > > created with well-known numbers that are supposed to be excluded from > > getting closed. > > For extra paranoia close_range() comes with a flags argument. This can e.g. > > be used to implement extension. Once can imagine userspace wanting to stop > > at the first error instead of ignoring errors under certain circumstances. > > > There might be other valid ideas in the future. In any case, a flag > > argument doesn't hurt and keeps us on the safe side. > > Here is another strange but real-live scenario: crash handler for dumping core. > > If applications has network connections it would be better to close them all, > otherwise clients will wait until end of dumping process or timeout. > Also closing normal files might be a good idea for releasing locks. > > But simple closing might race with other threads - closed fd will be reused > while some code still thinks it refers to original file. > > Our solution closes files without freeing fd: it opens /dev/null and > replaces all opened descriptors using dup2. > > So, special flag for close_range() could close files without clearing bitmap. > Effect should be the same - fd wouldn't be reused. > > Actually two flags for two phases: closing files and releasing fd. Konstantin, I'm sorry, I totally missed that part of your mail yesterday. Without speaking to the feasibility of this it's at least a good illustration that people really do have the possible need for a flag argument. Thanks! Christian