On Thu, Oct 28, 2021 at 12:31:13PM +0200, Christian Brauner wrote: > From: Christian Brauner <christian.brauner@xxxxxxxxxx> > > Currently, registering a new binary type pins the binfmt_misc > filesystem. Specifically, this means that as long as there is at least > one binary type registered the binfmt_misc filesystem survives all > umounts, i.e. the superblock is not destroyed. Meaning that a umount > followed by another mount will end up with the same superblock and the > same binary type handlers. This is a behavior we tend to discourage for > any new filesystems (apart from a few special filesystems such as e.g. > configfs or debugfs). A umount operation without the filesystem being > pinned - by e.g. someone holding a file descriptor to an open file - > should usually result in the destruction of the superblock and all > associated resources. This makes introspection easier and leads to > clearly defined, simple and clean semantics. An administrator can rely > on the fact that a umount will guarantee a clean slate making it > possible to reinitialize a filesystem. Right now all binary types would > need to be explicitly deleted before that can happen. > > This allows us to remove the heavy-handed calls to simple_pin_fs() and > simple_release_fs() when creating and deleting binary types. This in > turn allows us to replace the current brittle pinning mechanism abusing > dget() which has caused a range of bugs judging from prior fixes in [2] > and [3]. The additional dget() in load_misc_binary() pins the dentry but > only does so for the sake to prevent ->evict_inode() from freeing the > node when a user removes the binary type and kill_node() is run. Which > would mean ->interpreter and ->interp_file would be freed causing a UAF. > > This isn't really nicely documented nor is it very clean because it > relies on simple_pin_fs() pinning the filesystem as long as at least one > binary type exists. Otherwise it would cause load_misc_binary() to hold > on to a dentry belonging to a superblock that has been shutdown. > Replace that implicit pinning with a clean and simple per-node refcount > and get rid of the ugly dget() pinning. A similar mechanism exists for > e.g. binderfs (cf. [4]). All the cleanup work can now be done in > ->evict_inode(). > > In a follow-up patch we will make it possible to use binfmt_misc in > sandboxes. We will use the cleaner semantics where a umount for the > filesystem will cause the superblock and all resources to be > deallocated. In preparation for this apply the same semantics to the > initial binfmt_misc mount. Note, that this is a user-visible change and > as such a uapi change but one that we can reasonably risk. We've > discussed this in earlier versions of this patchset (cf. [1]). > > The main user and provider of binfmt_misc is systemd. Systemd provides > binfmt_misc via autofs since it is configurable as a kernel module and > is used by a few exotic packages and users. As such a binfmt_misc mount > is triggered when /proc/sys/fs/binfmt_misc is accessed and is only > provided on demand. Other autofs on demand filesystems include EFI ESP > which systemd umounts if the mountpoint stays idle for a certain amount > of time. This doesn't apply to the binfmt_misc autofs mount which isn't > touched once it is mounted meaning this change can't accidently wipe > binary type handlers without someone having explicitly unmounted > binfmt_misc. After speaking to systemd folks they don't expect this > change to affect them. > > In line with our general policy, if we see a regression for systemd or > other users with this change we will switch back to the old behavior for > the initial binfmt_misc mount and have binary types pin the filesystem > again. But while we touch this code let's take the chance and let's > improve on the status quo. > > [1]: https://lore.kernel.org/r/20191216091220.465626-2-laurent@xxxxxxxxx > [2]: commit 43a4f2619038 ("exec: binfmt_misc: fix race between load_misc_binary() and kill_node()" > [3]: commit 83f918274e4b ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()") > [4]: commit f0fe2c0f050d ("binder: prevent UAF for binderfs devices II") > Cc: Sargun Dhillon <sargun@xxxxxxxxx> > Cc: Serge Hallyn <serge@xxxxxxxxxx> This *looks* right to me. I'll keep looking back at this one while I look at the second patch, but Acked-by: Serge Hallyn <serge@xxxxxxxxxx> > Cc: Jann Horn <jannh@xxxxxxxxxx> > Cc: Henning Schild <henning.schild@xxxxxxxxxxx> > Cc: Andrei Vagin <avagin@xxxxxxxxx> > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > Cc: Laurent Vivier <laurent@xxxxxxxxx> > Cc: linux-fsdevel@xxxxxxxxxxxxxxx > Signed-off-by: Christian Brauner <christian.brauner@xxxxxxxxxx> > --- > fs/binfmt_misc.c | 56 +++++++++++++++++++++++++++++++----------------- > 1 file changed, 36 insertions(+), 20 deletions(-) > > diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c > index e1eae7ea823a..5a9d5e44c750 100644 > --- a/fs/binfmt_misc.c > +++ b/fs/binfmt_misc.c > @@ -60,12 +60,11 @@ typedef struct { > char *name; > struct dentry *dentry; > struct file *interp_file; > + refcount_t ref; > } Node; > > static DEFINE_RWLOCK(entries_lock); > static struct file_system_type bm_fs_type; > -static struct vfsmount *bm_mnt; > -static int entry_count; > > /* > * Max length of the register string. Determined by: > @@ -126,6 +125,16 @@ static Node *check_file(struct linux_binprm *bprm) > return NULL; > } > > +/* Free node if we are sure load_misc_binary() is done with it. */ > +static void put_node(Node *e) > +{ > + if (refcount_dec_and_test(&e->ref)) { > + if (e->flags & MISC_FMT_OPEN_FILE) > + filp_close(e->interp_file, NULL); > + kfree(e); > + } > +} > + > /* > * the loader itself > */ > @@ -142,8 +151,9 @@ static int load_misc_binary(struct linux_binprm *bprm) > /* to keep locking time low, we copy the interpreter string */ > read_lock(&entries_lock); > fmt = check_file(bprm); > + /* Make sure the node isn't freed behind our back. */ > if (fmt) > - dget(fmt->dentry); > + refcount_inc(&fmt->ref); > read_unlock(&entries_lock); > if (!fmt) > return retval; > @@ -198,7 +208,16 @@ static int load_misc_binary(struct linux_binprm *bprm) > > retval = 0; > ret: > - dput(fmt->dentry); > + > + /* > + * If we actually put the node here all concurrent calls to > + * load_misc_binary() will have finished. We also know > + * that for the refcount to be zero ->evict_inode() must have removed > + * the node to be deleted from the list. All that is left for us is to > + * close and free. > + */ > + put_node(fmt); > + > return retval; > } > > @@ -557,26 +576,29 @@ static void bm_evict_inode(struct inode *inode) > { > Node *e = inode->i_private; > > - if (e && e->flags & MISC_FMT_OPEN_FILE) > - filp_close(e->interp_file, NULL); > - > clear_inode(inode); > - kfree(e); > + > + if (e) { > + write_lock(&entries_lock); > + list_del_init(&e->list); > + write_unlock(&entries_lock); > + put_node(e); > + } > } > > static void kill_node(Node *e) > { > struct dentry *dentry; > > - write_lock(&entries_lock); > - list_del_init(&e->list); > - write_unlock(&entries_lock); > - > + /* > + * It's fine to unconditionally drop the dentry since ->evict_inode() > + * will check the refcount before freeing the node and so it can't go > + * away behind load_misc_binary()'s back. > + */ > dentry = e->dentry; > drop_nlink(d_inode(dentry)); > d_drop(dentry); > dput(dentry); > - simple_release_fs(&bm_mnt, &entry_count); > } > > /* /<entry> */ > @@ -683,13 +705,7 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, > if (!inode) > goto out2; > > - err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count); > - if (err) { > - iput(inode); > - inode = NULL; > - goto out2; > - } > - > + refcount_set(&e->ref, 1); > e->dentry = dget(dentry); > inode->i_private = e; > inode->i_fop = &bm_entry_operations; > > base-commit: 3906fe9bb7f1a2c8667ae54e967dc8690824f4ea > -- > 2.30.2