On Sun, Sep 15, 2024 at 07:39:05PM +0100, Al Viro wrote: > 2) calling thread MUST NOT unshare descriptor table while it has > any reserved descriptors. I.e. > fd = get_unused_fd(); > unshare_files(); > fd_install(fd, file); > is a bug. Reservations are discarded by that. Getting rid of that > constraint would require tracking the sets of reserved descriptors > separately for each thread that happens to share the descriptor table. > Conceptually they *are* per-thread - the same thread that has done > reservation must either discard it or use it. However, it's easier to > keep the "it's reserved by some thread" represented in descriptor table > itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is > NULL) than try and keep track of who's reserved what. The constraint is > basically "all reservations can stay with the old copy", i.e. "caller has > no reservations of its own to transfer into the new private copy it gets". FWIW, I toyed with the idea of having reservations kept per-thread; it is possible and it simplifies some things, but I hadn't been able to find a way to do that without buggering syscall latency for open() et.al. It would keep track of the set of reservations in task_struct (count, two-element array for the first two + page pointer for spillovers, for the rare threads that need more than two reserved simultaneously). Representation in fdtable: state open_fds bit value in ->fd[] array free clear 0UL reserved set 0UL uncommitted set 1UL|(unsigned long)file open set (unsigned long)file with file lookup treating any odd value as 0 (i.e. as reserved) fd_install() switching reserved to uncommitted *AND* separate "commit" operation that does this: if current->reservation_count == 0 return if failure for each descriptor in our reserved set v = fdtable->fd[descriptor] if (v) { fdtable->fd[descriptor] = 0; fput((struct file *)(v & ~1); } clear bit in fdtable->open_fds[] else for each descriptor in our reserved set v = fdtable->fd[descriptor] if (v) fdtable->fd[descriptor] = v & ~1; else BUG current->reservation_count = 0 That "commit" thing would be called on return from syscall for userland threads and would be called explicitly in kernel threads that have a descriptor table and work with it. The benefit would be that fd_install() would *NOT* have to be done after the last possible failure exit - something that installs a lot of files wouldn't have to play convoluted games on cleanup. Simply returning an error would do the right thing. Two stores into ->fd[] instead of one is not a big deal; however, anything like task_work_add() to arrange the call of "commit" ends up being bloody awful. We could have it called from syscall glue directly, but that means touching assembler for quite a few architectures, and I hadn't gotten around to that. Can be done, but it's not a pleasant work...