Experimental series trying to sanitize the handling of struct fd. Lightly tested, in serious need of review. It's 6.10-rc1-based, lives in git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.fd Individual patches in followups, descriptions below. Shortlog: Al Viro (19): powerpc: fix a file leak in kvm_vcpu_ioctl_enable_cap() lirc: rc_dev_get_from_fd(): fix file leak introduce fd_file(), convert all accessors to it. struct fd: representation change add struct fd constructors, get rid of __to_fd() net/socket.c: switch to CLASS(fd) introduce struct fderr, convert overlayfs uses to that fdget_raw() users: switch to CLASS(fd_raw, ...) css_set_fork(): switch to CLASS(fd_raw, ...) introduce "fd_pos" class switch simple users of fdget() to CLASS(fd, ...) bpf: switch to CLASS(fd, ...) convert vmsplice() to CLASS(fd, ...) finit_module(): convert to CLASS(fd, ...) timerfd: switch to CLASS(fd, ...) do_mq_notify(): switch to CLASS(fd, ...) simplify xfs_find_handle() a bit convert kernel/events/core.c deal with the last remaing boolean uses of fd_file() Diffstat: arch/alpha/kernel/osf_sys.c | 7 +- arch/arm/kernel/sys_oabi-compat.c | 18 +- arch/powerpc/kvm/book3s_64_vio.c | 9 +- arch/powerpc/kvm/powerpc.c | 26 +-- arch/powerpc/platforms/cell/spu_syscalls.c | 17 +- arch/x86/kernel/cpu/sgx/main.c | 10 +- arch/x86/kvm/svm/sev.c | 43 ++-- drivers/gpu/drm/amd/amdgpu/amdgpu_sched.c | 27 +-- drivers/gpu/drm/drm_syncobj.c | 11 +- drivers/infiniband/core/ucma.c | 21 +- drivers/infiniband/core/uverbs_cmd.c | 12 +- drivers/media/mc/mc-request.c | 22 +- drivers/media/rc/lirc_dev.c | 13 +- drivers/vfio/group.c | 10 +- drivers/vfio/virqfd.c | 20 +- drivers/virt/acrn/irqfd.c | 14 +- drivers/xen/privcmd.c | 35 ++-- fs/btrfs/ioctl.c | 7 +- fs/coda/inode.c | 13 +- fs/eventfd.c | 9 +- fs/eventpoll.c | 62 ++---- fs/ext4/ioctl.c | 23 +- fs/f2fs/file.c | 17 +- fs/fcntl.c | 74 +++---- fs/fhandle.c | 7 +- fs/file.c | 26 +-- fs/fsopen.c | 23 +- fs/fuse/dev.c | 10 +- fs/ioctl.c | 47 ++--- fs/kernel_read_file.c | 12 +- fs/locks.c | 27 +-- fs/namei.c | 19 +- fs/namespace.c | 53 ++--- fs/notify/fanotify/fanotify_user.c | 50 ++--- fs/notify/inotify/inotify_user.c | 44 ++-- fs/ocfs2/cluster/heartbeat.c | 17 +- fs/open.c | 67 +++--- fs/overlayfs/file.c | 187 +++++++---------- fs/quota/quota.c | 18 +- fs/read_write.c | 227 +++++++++----------- fs/readdir.c | 38 ++-- fs/remap_range.c | 11 +- fs/select.c | 17 +- fs/signalfd.c | 11 +- fs/smb/client/ioctl.c | 17 +- fs/splice.c | 82 +++----- fs/stat.c | 10 +- fs/statfs.c | 12 +- fs/sync.c | 33 ++- fs/timerfd.c | 42 ++-- fs/utimes.c | 11 +- fs/xattr.c | 64 +++--- fs/xfs/xfs_exchrange.c | 12 +- fs/xfs/xfs_handle.c | 16 +- fs/xfs/xfs_ioctl.c | 85 +++----- include/linux/bpf.h | 9 +- include/linux/cleanup.h | 2 +- include/linux/file.h | 89 +++++--- io_uring/sqpoll.c | 31 +-- ipc/mqueue.c | 126 +++++------ kernel/bpf/bpf_inode_storage.c | 29 +-- kernel/bpf/btf.c | 13 +- kernel/bpf/map_in_map.c | 37 +--- kernel/bpf/syscall.c | 197 ++++++----------- kernel/bpf/token.c | 19 +- kernel/bpf/verifier.c | 20 +- kernel/cgroup/cgroup.c | 21 +- kernel/events/core.c | 67 +++--- kernel/module/main.c | 15 +- kernel/nsproxy.c | 15 +- kernel/pid.c | 26 +-- kernel/signal.c | 33 ++- kernel/sys.c | 21 +- kernel/taskstats.c | 20 +- kernel/watch_queue.c | 8 +- mm/fadvise.c | 10 +- mm/filemap.c | 19 +- mm/memcontrol.c | 37 ++-- mm/readahead.c | 25 +-- net/core/net_namespace.c | 14 +- net/core/sock_map.c | 23 +- net/socket.c | 325 +++++++++++++---------------- security/integrity/ima/ima_main.c | 9 +- security/landlock/syscalls.c | 57 ++--- security/loadpin/loadpin.c | 10 +- sound/core/pcm_native.c | 6 +- virt/kvm/eventfd.c | 19 +- virt/kvm/vfio.c | 18 +- 88 files changed, 1234 insertions(+), 1951 deletions(-) 01/19) powerpc: fix a file leak in kvm_vcpu_ioctl_enable_cap() 02/19) lirc: rc_dev_get_from_fd(): fix file leak First two patches are obvious leak fixes - missing fdput() on one a failure exits. 03/19) introduce fd_file(), convert all accessors to it. For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NB: this commit is where I'd expect arseloads of conflicts through the cycle, simply because of the breadth of area being touched. The biggest one, as well (500 lines modified). Might be worth splitting - not sure. 04/19) struct fd: representation change The absolute majority of instances comes from fdget() and its relatives; the underlying primitives actually return a struct file reference and a couple of flags encoded into an unsigned long - the lower two bits of file address are always zero, so we can stash the flags into those. On the way out we use __to_fd() to unpack that unsigned long into struct fd. Let's use that representation for struct fd itself - make it a structure with a single unsigned long member (.word), with the value equal either to (unsigned long)p | flags, p being an address of some struct file instance, or to 0 for an empty fd. Note that we never used a struct fd instance with NULL ->file and non-zero ->flags; the emptiness had been checked as (!f.file) and we expected e.g. fdput(empty) to be a no-op. With new representation we can use (!f.word) for emptiness check; that is enough for compiler to figure out that (f.word & FDPUT_FPUT) will be false and that fdput(f) will be a no-op in such case. For now the new predicate (fd_empty(f)) has no users; all the existing checks have form (!fd_file(f)). We will convert to fd_empty() use later; here we only define it (and tell the compiler that it's unlikely to return true). This commit only deals with representation change; there will be followups. NOTE: overlayfs part is _not_ in the final form - it will be massaged shortly. 05/19) add struct fd constructors, get rid of __to_fd() Make __fdget() et.al. return struct fd directly. New helpers: BORROWED_FD(file) and CLONED_FD(file), for borrowed and cloned file references resp. NOTE: this might need tuning; in particular, inline on __fget_light() is there to keep the code generation same as before - we probably want to keep it inlined in fdget() et.al. (especially so in fdget_pos()), but that needs profiling. Next two commits deal with the worst irregularities in struct fd use: in net/socket.c we have fdget() without matching fdput() - fdget() is done in sockfd_lookup_light(), then the results are passed (in modified form) to caller, which deals with conditional fput(). And in overlayfs we have an almost-but-not-quite struct fd shoehorned into struct fd, with ugly calling conventions as the result of that. I'm not sure what order would be better for these two commits. 06/19) net/socket.c: switch to CLASS(fd) I strongly suspect that important part in sockfd_lookup_light() is avoiding needless file refcount operations, not the marginal reduction of the register pressure from not keeping a struct file pointer in the caller. If that's true, we should get the same benefits from straight fdget()/fdput(). And AFAICS with sane use of CLASS(fd) we can get a better code generation... Would be nice if somebody tested it on networking test suites (including benchmarks)... sockfd_lookup_light() does fdget(), uses sock_from_file() to get the associated socket and returns the struct socket reference to the caller, along with "do we need to fput()" flag. No matching fdput(), the caller does its equivalent manually, using the fact that sock->file points to the struct file the socket has come from. Get rid of that - have the callers do fdget()/fdput() and use sock_from_file() directly. That kills sockfd_lookup_light() and fput_light() (no users left). What's more, we can get rid of explicit fdget()/fdput() by switching to CLASS(fd, ...) - code generation does not suffer, since now fdput() inserted on "descriptor is not opened" failure exit is recognized to be a no-op by compiler. We could split that commit in two (getting rid of sockd_lookup_light() and switch to CLASS(fd, ...)), but AFAICS it ends up being harder to read that way. 07/19) introduce struct fderr, convert overlayfs uses to that Similar to struct fd; unlike struct fd, it can represent error values. Accessors: * fd_empty(f): true if f represents an error * fd_file(f): just as for struct fd it yields a pointer to struct file if fd_empty(f) is false. If fd_empty(f) is true, fd_file(f) is guaranteed _not_ to be an address of any object (IS_ERR() will be true in that case) * fd_error(f): if f represents an error, returns that error, otherwise the return value is junk. Constructors: * ERR_FD(-E...): an instance encoding given error [ERR_FDERR, perhaps?] * BORROWED_FDERR(file): if file points to a struct file instance, return a struct fderr representing that file reference with no flags set. if file is an ERR_PTR(-E...), return a struct fderr representing that error. file MUST NOT be NULL. * CLONED_FDERR(file): similar, but in case when file points to a struct file instance, set FDPUT_FPUT in flags. fdput_err() serves as a destructor. See fs/overlayfs/file.c for example of use. 08/19) fdget_raw() users: switch to CLASS(fd_raw, ...) all convert trivially 09/19) css_set_fork(): switch to CLASS(fd_raw, ...) reference acquired there by fget_raw() is not stashed anywhere - we could as well borrow instead. 10/19) introduce "fd_pos" class fdget_pos() for constructor, fdput_pos() for cleanup, users of fd..._pos() converted. 11/19) switch simple users of fdget() to CLASS(fd, ...) low-hanging fruit; that's another likely source of conflicts over the cycle and it might make a lot of sense to split; fortunately, it can be split pretty much on per-function basis - chunks are independent from each other. 12/19) bpf: switch to CLASS(fd, ...) Calling conventions for __bpf_map_get() would be more convenient if it left fpdut() on failure to callers. Makes for simpler logics in the callers. Among other things, the proof of memory safety no longer has to rely upon file->private_data never being ERR_PTR(...) for bpffs files. Original calling conventions made it impossible for the caller to tell whether __bpf_map_get() has returned ERR_PTR(-EINVAL) because it has found the file not be a bpf map one (in which case it would've done fdput()) or because it found that ERR_PTR(-EINVAL) in file->private_data of a bpf map file (in which case fdput() would _not_ have been done). With that calling conventions change it's easy to switch all bpffs users to CLASS(fd, ...). 13/19) convert vmsplice() to CLASS(fd, ...) Irregularity here is fdput() not in the same scope as fdget(); we could just lift it out vmsplice_type() in vmsplice(2), but there's no much point keeping vmsplice_type() separate after that... 14/19) finit_module(): convert to CLASS(fd, ...) Slightly unidiomatic emptiness check; just lift it out of idempotent_init_module() and into finit_module(2) and that's it. 15/19) timerfd: switch to CLASS(fd, ...) Fold timerfd_fget() into both callers to have fdget() and fdput() in the same scope. Could be done in different ways, but this is probably the smallest solution. 16/19) do_mq_notify(): switch to CLASS(fd, ...) a minor twist is the reuse of struct fd instance in there 17/19) simplify xfs_find_handle() a bit XFS_IOC_FD_TO_HANDLE can grab a reference to copied ->f_path and let the file go; results in simpler control flow - cleanup is the same for both "by descriptor" and "by pathname" cases. NOTE: grabbing f->f_path to pin file_inode(f) is valid, since we are dealing with XFS files here - no reassignments of file_inode(). 18/19) convert kernel/events/core.c a questionable trick in perf_event_open(2) - deliberate call of fdget(-1), expecting it to yield empty 19/19) deal with the last remaing boolean uses of fd_file() ... replacing them with uses of fd_empty()