[in vfs.git#fixes] copy_fd_bitmaps(new, old, count) is expected to copy the first count/BITS_PER_LONG bits from old->full_fds_bits[] and fill the rest with zeroes. What it does is copying enough words (BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest. That works fine, *if* all bits past the cutoff point are clear. Otherwise we are risking garbage from the last word we'd copied. For most of the callers that is true - expand_fdtable() has count equal to old->max_fds, so there's no open descriptors past count, let alone fully occupied words in ->open_fds[], which is what bits in ->full_fds_bits[] correspond to. The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds), which is the smallest multiple of BITS_PER_LONG that covers all opened descriptors below max_fds. In the common case (copying on fork()) max_fds is ~0U, so all opened descriptors will be below it and we are fine, by the same reasons why the call in expand_fdtable() is safe. Unfortunately, there is a case where max_fds is less than that and where we might, indeed, end up with junk in ->full_fds_bits[] - close_range(from, to, CLOSE_RANGE_UNSHARE) with * descriptor table being currently shared * 'to' being above the current capacity of descriptor table * 'from' being just under some chunk of opened descriptors. In that case we end up with observably wrong behaviour - e.g. spawn a child with CLONE_FILES, get all descriptors in range 0..127 open, then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending up with descriptor #128, despite #64 being observably not open. The best way to fix that is in dup_fd() - there we both can easily check whether we might need to fix anything up and see which word needs the upper bits cleared. Reproducer follows: #define __GNU_SOURCE #include <linux/close_range.h> #include <unistd.h> #include <fcntl.h> #include <signal.h> #include <sched.h> #include <stdio.h> #include <stdbool.h> #include <sys/mman.h> void is_open(int fd) { printf("#%d is %s\n", fd, fcntl(fd, F_GETFD) >= 0 ? "opened" : "not opened"); } int child(void *unused) { while(1) { } return 0; } int main(void) { char *stack; pid_t pid; stack = mmap(NULL, 1024*1024, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0); if (stack == MAP_FAILED) { perror("mmap"); return -1; } pid = clone(child, stack + 1024*1024, CLONE_FILES | SIGCHLD, NULL); if (pid == -1) { perror("clone"); return -1; } for (int i = 2; i < 128; i++) dup2(0, i); close_range(64, ~0U, CLOSE_RANGE_UNSHARE); is_open(64); printf("dup(0) => %d, expected 64\n", dup(0)); kill(pid, 9); return 0; } Cc: stable@xxxxxxxxxxxxxxx Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> --- diff --git a/fs/file.c b/fs/file.c index a11e59b5d602..7f0ab8636d7c 100644 --- a/fs/file.c +++ b/fs/file.c @@ -380,6 +380,20 @@ struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int } copy_fd_bitmaps(new_fdt, old_fdt, open_files); + if (unlikely(max_fds != NR_OPEN_MAX)) { + /* + * if there had been opened descriptors past open_files, + * the last copied word in full_fds_bits might have picked + * stray bits; nothing of that sort happens to open_fds and + * close_on_exec, since there the part that needs to be copied + * will always fall on a word boundary. + */ + unsigned int n = open_files / BITS_PER_LONG; + if (n % BITS_PER_LONG) { + unsigned long mask = BITMAP_LAST_WORD_MASK(n); + new_fdt->full_fds_bits[n / BITS_PER_LONG] &= mask; + } + } old_fds = old_fdt->fd; new_fds = new_fdt->fd;