Re: [RFC] more close_range() fun

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Fri, 16 Aug 2024 18:19:25 +0100

On Fri, Aug 16, 2024 at 09:26:45AM -0700, Linus Torvalds wrote:
> On Thu, 15 Aug 2024 at 20:03, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > It *can* actually happen - all it takes is close_range(2) decision
> > to trim the copied descriptor table made before the first dup2()
> > and actual copying done after both dup2() are done.
> 
> I think this is fine. It's one of those "if user threads have no
> serialization, they get what they get" situations.

As it is, unshare(CLOSE_FILES) gives you a state that might be possible
if you e.g. attached a debugger to the process and poked around in
descriptor table.  CLOSE_RANGE_UNSHARE is supposed to be a shortcut
for unshare + plain close_range(), so having it end up with weird
states looks wrong.

For descriptor tables we have something very close to TSO (and possibly
the full TSO - I'll need to get some coffee and go through the barriers
we've got on the lockless side of fd_install()); this, OTOH, is not
quite Alpha-level weirdness, but it's not far from that.  And unlike
Alpha we don't have excuses along the lines of "it's cheaper that way" -
it really isn't any cheaper.

The variant I'm testing right now seems to be doing fine (LTP and about
halfway through the xfstests, with no regressions and no slowdowns)
and it's at
 fs/file.c               | 63 +++++++++++++++++--------------------------------
 include/linux/fdtable.h |  6 ++---
 kernel/fork.c           | 11 ++++-----
 3 files changed, 28 insertions(+), 52 deletions(-)

Basically,
	* switch CLOSE_UNSHARE_RANGE from unshare_fd() to dup_fd()
	* instead of "trim down to that much" pass dup_fd() an
optional "we'll be punching a hole from <this> to <that>", which
gets passed to sane_fdtable_size() (NULL == no hole to be punched).
	* in sane_fdtable_size()
		find last occupied bit in ->open_fds[]
		if asked to punch a hole and if that last bit is within
the hole, find last occupied bit below the hole
		round up last occupied plus 1 to BITS_PER_LONG.
All it takes, and IMO it's simpler that way.