Re: [PATCH v10 6/8] rust: file: add `FileDescriptorReservation`

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Sun, 15 Sep 2024 23:01:26 +0100

> What happens if I call `get_unused_fd_flags`, and then never call
> `put_unused_fd`? Assume that I don't try to use the fd in the future,
> and that I just forgot to clean up after myself.

Moderate amount of bogosities while the thread exists.  For one thing,
descriptor is leaked - for open() et.al. it will look like it's in use.
For close() it will look like it's _not_ open (i.e. trying to close it
from userland will quietly do nothing).  Trying to overwrite it with
dup2() will keep failing with -EBUSY.

Kernel-side it definitely violates assertions, but currently nothing
will break.  Might or might not remain true in the future.  Doing that
again and again would lead to inflated descriptor table, but then
so would dup2(0, something_large).

IOW, it would be a bug, but it's probably not going to be high impact
security hole.

> >         + execve() at the point of no return (in begin_new_exec()).
	      execve(2), sorry.
> > That's the only place where violation of that constraint on some later
> > changes is plausible.  That one needs to be watched out for.

> Thanks for going through that.

I'm in the middle of writing documentation on the descriptor table and
struct file handling right now anyway...

> From a Rust perspective, it sounds
> easiest to just declare that execve() is an unsafe operation, and that
> one of the conditions for using it is that you don't hold any fd
> reservations. Trying to encode this in the fd reservation logic seems
> too onerous, and I'm guessing execve is not used particularly often
> anyway.

Sorry, bad editing on my part - I should've clearly marked execve() as a
syscall.  It's not that it's an unsafe operation - it's only called from
userland anyway, so it's not going to happen in scope of any reserved
descriptor.

The problem is different:

* userland calls execve("/usr/bin/whatever", argv, envp)

* that proceeds through several wrappers to do_execveat_common().
  There are several syscalls in that family and they all converge
  to do_execveat_common() - wrappers just deal with difference
  in calling conventions.

* do_execveat_common() set up exec context (struct linux_binprm).
  That opens the binary to be executed, creates memory context
  to be used, calculates argc, sets up argv and envp on what will
  become the userland stack for the new program, etc. - basically,
  all the work for marshalling the data from caller's memory.
  Then it calls bprm_execve(), which is where the rest of the work
  will be done.

* bprm_execve() eventually calls exec_binprm().  That calls
  search_binary_handler(), which is where we finally get a look
  at the binary we are trying to load.  search_binary_handler()
  goes through the known binary formats (ELF, script, etc.)
  and tries to offer the exec context to ->load_binary() of
  each.

* ->load_binary() instance looks at the binary (starting with the
  first 256 bytes read for us by prepare_binprm() called in the
  beginning of search_binary_handler()).  If it doesn't have the
  right magic values, ->load_binary() just returns -ENOEXEC,
  so that the next format would be tried.  If it *does* look like
  something this format is going to deal with, more sanity checks
  are done, things are set up, etc. - details depend upon the
  binary format in question.  See load_elf_binary() for some taste
  of that.  Eventually it decides to actually discard the old
  memory and switch to new binary.  Up to that point it can
  return an error - -ENOEXEC for soft ones ("not mine, after all,
  try other formats"), something like -EINVAL/-ENOMEM/-EPERM/-EIO/etc.
  for hard ones ("fail execve(2) with that error").  _After_ that
  point we have nothing to return to; old binary is not mapped
  anymore, userland stack frame is gone, etc.  Any errors past
  that point are treated as "kill me".

  At the point of no return we call begin_new_exec().  That
  kills all other threads, unshares descriptor table in case it
  had been shared wider than the thread group, switches memory
  context, etc.

  Once begin_new_exec() is done, we can safely map whatever we
  want to map, handle relocations, etc.  Among other things,
  we modify the userland register values saved on syscall entry,
  so that once we return from syscall we'll end up with the
  right register state at the right userland entry point, etc.
  If everything goes fine, ->load_binary() returns 0 into
  search_binary_handler() and we are pretty much done - some
  cleanups on the way out and off to the loaded binary.

  Alternatively, we may decide to mangle the exec context -
  that's what #! handling does (see load_script() - basically
  it opens the interpreter and massages the arguments, so
  that something like debian/rules build-arch turns into
  /usr/bin/make -f debian/rules build-arch and tells the
  caller to deal with that; this is what the loop in
  search_binary_handler() is about).

There's not a lot of binary formats (5 of those currently -
all in fs/binmt_*.c), but there's nothing to prohibit more
of them.  If somebody decides to add the infrastructure for
writing those in Rust, begin_new_exec() wrapper will need
to be documented as "never call that in scope of reserved
descriptor".  Maybe by marking that wrapper unsafe and
telling the users about the restriction wrt descriptor
reservations, maybe by somehow telling the compiler to
watch out for that - or maybe the constraint will be gone
by that time.

In any case, the underlying constraint ("a thread with
reserved descriptors should not try to get a private
descriptor table until all those descriptors are disposed
of one way or another") needs to be documented.