On Sat, Aug 03, 2024 at 11:50:54PM GMT, Al Viro wrote: > [in vfs.git#fixes] > > copy_fd_bitmaps(new, old, count) is expected to copy the first > count/BITS_PER_LONG bits from old->full_fds_bits[] and fill > the rest with zeroes. What it does is copying enough words > (BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest. > That works fine, *if* all bits past the cutoff point are > clear. Otherwise we are risking garbage from the last word > we'd copied. > > For most of the callers that is true - expand_fdtable() has > count equal to old->max_fds, so there's no open descriptors > past count, let alone fully occupied words in ->open_fds[], > which is what bits in ->full_fds_bits[] correspond to. > > The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds), > which is the smallest multiple of BITS_PER_LONG that covers all > opened descriptors below max_fds. In the common case (copying on > fork()) max_fds is ~0U, so all opened descriptors will be below > it and we are fine, by the same reasons why the call in expand_fdtable() > is safe. > > Unfortunately, there is a case where max_fds is less than that > and where we might, indeed, end up with junk in ->full_fds_bits[] - > close_range(from, to, CLOSE_RANGE_UNSHARE) with > * descriptor table being currently shared > * 'to' being above the current capacity of descriptor table > * 'from' being just under some chunk of opened descriptors. > In that case we end up with observably wrong behaviour - e.g. spawn > a child with CLONE_FILES, get all descriptors in range 0..127 open, > then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending > up with descriptor #128, despite #64 being observably not open. > > The best way to fix that is in dup_fd() - there we both can easily check > whether we might need to fix anything up and see which word needs the > upper bits cleared. > > Reproducer follows: > > #define __GNU_SOURCE > #include <linux/close_range.h> > #include <unistd.h> > #include <fcntl.h> > #include <signal.h> > #include <sched.h> > #include <stdio.h> > #include <stdbool.h> > #include <sys/mman.h> > > void is_open(int fd) > { > printf("#%d is %s\n", fd, > fcntl(fd, F_GETFD) >= 0 ? "opened" : "not opened"); > } > > int child(void *unused) > { > while(1) { > } > return 0; > } > > int main(void) > { > char *stack; > pid_t pid; > > stack = mmap(NULL, 1024*1024, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0); > if (stack == MAP_FAILED) { > perror("mmap"); > return -1; > } > > pid = clone(child, stack + 1024*1024, CLONE_FILES | SIGCHLD, NULL); > if (pid == -1) { > perror("clone"); > return -1; > } > for (int i = 2; i < 128; i++) > dup2(0, i); > close_range(64, ~0U, CLOSE_RANGE_UNSHARE); > > is_open(64); > printf("dup(0) => %d, expected 64\n", dup(0)); > > kill(pid, 9); > return 0; > } > > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> Hm, this should probably also get a fixes tag for 4b3b3b3b3b3b ("vfs: clear remainder of 'full_fds_bits' in dup_fd()") (and possibly for the CLOSE_RANGE_UNSHARE addition).