On Fri, Aug 16, 2024 at 09:26:45AM -0700, Linus Torvalds wrote: > On Thu, 15 Aug 2024 at 20:03, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > > > It *can* actually happen - all it takes is close_range(2) decision > > to trim the copied descriptor table made before the first dup2() > > and actual copying done after both dup2() are done. > > I think this is fine. It's one of those "if user threads have no > serialization, they get what they get" situations. As it is, unshare(CLOSE_FILES) gives you a state that might be possible if you e.g. attached a debugger to the process and poked around in descriptor table. CLOSE_RANGE_UNSHARE is supposed to be a shortcut for unshare + plain close_range(), so having it end up with weird states looks wrong. For descriptor tables we have something very close to TSO (and possibly the full TSO - I'll need to get some coffee and go through the barriers we've got on the lockless side of fd_install()); this, OTOH, is not quite Alpha-level weirdness, but it's not far from that. And unlike Alpha we don't have excuses along the lines of "it's cheaper that way" - it really isn't any cheaper. The variant I'm testing right now seems to be doing fine (LTP and about halfway through the xfstests, with no regressions and no slowdowns) and it's at fs/file.c | 63 +++++++++++++++++-------------------------------- include/linux/fdtable.h | 6 ++--- kernel/fork.c | 11 ++++----- 3 files changed, 28 insertions(+), 52 deletions(-) Basically, * switch CLOSE_UNSHARE_RANGE from unshare_fd() to dup_fd() * instead of "trim down to that much" pass dup_fd() an optional "we'll be punching a hole from <this> to <that>", which gets passed to sane_fdtable_size() (NULL == no hole to be punched). * in sane_fdtable_size() find last occupied bit in ->open_fds[] if asked to punch a hole and if that last bit is within the hole, find last occupied bit below the hole round up last occupied plus 1 to BITS_PER_LONG. All it takes, and IMO it's simpler that way.