On Tue, Apr 25, 2023 at 02:34:15PM +0200, Christian Brauner wrote: > It is rife with misunderstandings just looking at what we did in > kernel/fork.c earlier: > > retval = get_unused_fd_flags(O_RDWR | O_CLOEXEC); > [...] > pidfile = anon_inode_getfile("[pidfd]", &pidfd_fops, pid, > O_RDWR | O_CLOEXEC); > > seeing where both get_unused_fd_flags() and both *_getfile() take flag > arguments. Sure, for us this is pretty straightforward since we've seen > that code a million times. For others it's confusing why there's two > flag arguments. Sure, we could use one flags argument but it still be > weird to look at. First of all, get_unused_fd_flags() doesn't give a damn about O_RDWR and anon_inode_getfile() - about O_CLOEXEC. Duplicating the expression in places like that is cargo-culting. > But with this api we also force all users to remember that they need to > cleanup the fd and the file - but definitely one - depending on where > they fail. > > But conceptually the fd and the file belong together. After all it's the > file we're about to make that reserved fd refer to. Why? The common pattern is * reserve the descriptor * do some work that might fail * get struct file set up (which also might fail) * do more work (possibly including copyout, etc.) * install file into descriptor table We want to reserve the descriptor early, since it's about the easiest thing to undo. Why bother doing it next to file setup? That can be pretty deep in call chain and doing it that way comes with headache about passing the damn thing around. And cleanup rules with your variant end up being more complicated. Keep the manipulations of descriptor table as close to the surface as possible. The real work is with file references; descriptors belong in userland and passing them around kernel-side is asking for headache.