> What happens if I call `get_unused_fd_flags`, and then never call > `put_unused_fd`? Assume that I don't try to use the fd in the future, > and that I just forgot to clean up after myself. Moderate amount of bogosities while the thread exists. For one thing, descriptor is leaked - for open() et.al. it will look like it's in use. For close() it will look like it's _not_ open (i.e. trying to close it from userland will quietly do nothing). Trying to overwrite it with dup2() will keep failing with -EBUSY. Kernel-side it definitely violates assertions, but currently nothing will break. Might or might not remain true in the future. Doing that again and again would lead to inflated descriptor table, but then so would dup2(0, something_large). IOW, it would be a bug, but it's probably not going to be high impact security hole. > > + execve() at the point of no return (in begin_new_exec()). execve(2), sorry. > > That's the only place where violation of that constraint on some later > > changes is plausible. That one needs to be watched out for. > Thanks for going through that. I'm in the middle of writing documentation on the descriptor table and struct file handling right now anyway... > From a Rust perspective, it sounds > easiest to just declare that execve() is an unsafe operation, and that > one of the conditions for using it is that you don't hold any fd > reservations. Trying to encode this in the fd reservation logic seems > too onerous, and I'm guessing execve is not used particularly often > anyway. Sorry, bad editing on my part - I should've clearly marked execve() as a syscall. It's not that it's an unsafe operation - it's only called from userland anyway, so it's not going to happen in scope of any reserved descriptor. The problem is different: * userland calls execve("/usr/bin/whatever", argv, envp) * that proceeds through several wrappers to do_execveat_common(). There are several syscalls in that family and they all converge to do_execveat_common() - wrappers just deal with difference in calling conventions. * do_execveat_common() set up exec context (struct linux_binprm). That opens the binary to be executed, creates memory context to be used, calculates argc, sets up argv and envp on what will become the userland stack for the new program, etc. - basically, all the work for marshalling the data from caller's memory. Then it calls bprm_execve(), which is where the rest of the work will be done. * bprm_execve() eventually calls exec_binprm(). That calls search_binary_handler(), which is where we finally get a look at the binary we are trying to load. search_binary_handler() goes through the known binary formats (ELF, script, etc.) and tries to offer the exec context to ->load_binary() of each. * ->load_binary() instance looks at the binary (starting with the first 256 bytes read for us by prepare_binprm() called in the beginning of search_binary_handler()). If it doesn't have the right magic values, ->load_binary() just returns -ENOEXEC, so that the next format would be tried. If it *does* look like something this format is going to deal with, more sanity checks are done, things are set up, etc. - details depend upon the binary format in question. See load_elf_binary() for some taste of that. Eventually it decides to actually discard the old memory and switch to new binary. Up to that point it can return an error - -ENOEXEC for soft ones ("not mine, after all, try other formats"), something like -EINVAL/-ENOMEM/-EPERM/-EIO/etc. for hard ones ("fail execve(2) with that error"). _After_ that point we have nothing to return to; old binary is not mapped anymore, userland stack frame is gone, etc. Any errors past that point are treated as "kill me". At the point of no return we call begin_new_exec(). That kills all other threads, unshares descriptor table in case it had been shared wider than the thread group, switches memory context, etc. Once begin_new_exec() is done, we can safely map whatever we want to map, handle relocations, etc. Among other things, we modify the userland register values saved on syscall entry, so that once we return from syscall we'll end up with the right register state at the right userland entry point, etc. If everything goes fine, ->load_binary() returns 0 into search_binary_handler() and we are pretty much done - some cleanups on the way out and off to the loaded binary. Alternatively, we may decide to mangle the exec context - that's what #! handling does (see load_script() - basically it opens the interpreter and massages the arguments, so that something like debian/rules build-arch turns into /usr/bin/make -f debian/rules build-arch and tells the caller to deal with that; this is what the loop in search_binary_handler() is about). There's not a lot of binary formats (5 of those currently - all in fs/binmt_*.c), but there's nothing to prohibit more of them. If somebody decides to add the infrastructure for writing those in Rust, begin_new_exec() wrapper will need to be documented as "never call that in scope of reserved descriptor". Maybe by marking that wrapper unsafe and telling the users about the restriction wrt descriptor reservations, maybe by somehow telling the compiler to watch out for that - or maybe the constraint will be gone by that time. In any case, the underlying constraint ("a thread with reserved descriptors should not try to get a private descriptor table until all those descriptors are disposed of one way or another") needs to be documented.