On Sun, Sep 15, 2024 at 8:39 PM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > On Sun, Sep 15, 2024 at 02:31:32PM +0000, Alice Ryhl wrote: > > > +impl Drop for FileDescriptorReservation { > > + fn drop(&mut self) { > > + // SAFETY: By the type invariants of this type, `self.fd` was previously returned by > > + // `get_unused_fd_flags`. We have not yet used the fd, so it is still valid, and `current` > > + // still refers to the same task, as this type cannot be moved across task boundaries. > > + unsafe { bindings::put_unused_fd(self.fd) }; > > + } > > +} > > FWIW, it's a bit more delicate. The real rules for API users are > > 1) anything you get from get_unused_fd_flags() (well, alloc_fd(), > internally) must be passed either to put_unused_fd() or fd_install() before > you return from syscall. That should be done by the same thread and > all calls of put_unused_fd() or fd_install() should be paired with > some get_unused_fd_flags() in that manner (i.e. by the same thread, > within the same syscall, etc.) Ok, we have to use it before returning from the syscall. That's fine. What happens if I call `get_unused_fd_flags`, and then never call `put_unused_fd`? Assume that I don't try to use the fd in the future, and that I just forgot to clean up after myself. > 2) calling thread MUST NOT unshare descriptor table while it has > any reserved descriptors. I.e. > fd = get_unused_fd(); > unshare_files(); > fd_install(fd, file); > is a bug. Reservations are discarded by that. Getting rid of that > constraint would require tracking the sets of reserved descriptors > separately for each thread that happens to share the descriptor table. > Conceptually they *are* per-thread - the same thread that has done > reservation must either discard it or use it. However, it's easier to > keep the "it's reserved by some thread" represented in descriptor table > itself (bit set in ->open_fds bitmap, file reference in ->fd[] array is > NULL) than try and keep track of who's reserved what. The constraint is > basically "all reservations can stay with the old copy", i.e. "caller has > no reservations of its own to transfer into the new private copy it gets". > It's not particularly onerous[*] and it simplifies things > quite a bit. However, if we are documenting thing, it needs to be > put explicitly. With respect to Rust, if you do e.g. binfmt-in-rust > support it will immediately become an issue - begin_new_exec() is calling > unshare_files(), so the example above can become an issue. > > Internally (in fs/file.c, that is) we have additional safety > rule - anything that might be given an arbitrary descriptor (e.g. > do_dup2() destination can come directly from dup2(2) argument, > file_close_fd_locked() victim can come directly from close(2) one, > etc.) must leave reserved descriptors alone. Not an issue API users > need to watch out for, though. > > [*] unsharing the descriptor table is done by > + close_range(2), which has no reason to allocate any descriptors > and is only called by userland. > + unshare(2), which has no reason to allocate any descriptors > and is only called by userland. > + a place in early init that call ksys_unshare() while arranging > the environment for /linuxrc from initrd image to be run. Again, no > reserved descriptors there. > + coredumping thread in the beginning of do_coredump(). > The caller is at the point of signal delivery, which means that it had > already left whatever syscall it might have been in. Which means > that all reservations must have been undone by that point. > + execve() at the point of no return (in begin_new_exec()). > That's the only place where violation of that constraint on some later > changes is plausible. That one needs to be watched out for. Thanks for going through that. From a Rust perspective, it sounds easiest to just declare that execve() is an unsafe operation, and that one of the conditions for using it is that you don't hold any fd reservations. Trying to encode this in the fd reservation logic seems too onerous, and I'm guessing execve is not used particularly often anyway. Alice