On Mon, Dec 16, 2019 at 9:05 PM Laurent Vivier <laurent@xxxxxxxxx> wrote: > Le 16/12/2019 à 20:08, Jann Horn a écrit : > > On Mon, Dec 16, 2019 at 10:12 AM Laurent Vivier <laurent@xxxxxxxxx> wrote: > >> This patch allows to have a different binfmt_misc configuration > >> for each new user namespace. By default, the binfmt_misc configuration > >> is the one of the previous level, but if the binfmt_misc filesystem is > >> mounted in the new namespace a new empty binfmt instance is created and > >> used in this namespace. > >> > >> For instance, using "unshare" we can start a chroot of another > >> architecture and configure the binfmt_misc interpreter without being root > >> to run the binaries in this chroot. > > > > How do you ensure that when userspace is no longer using the user > > namespace and mount namespace, the entries and the binfmt_misc > > superblock are deleted? As far as I can tell from looking at the code, > > at the moment, if I create a user namespace+mount namespace, mount > > binfmt_misc in there, register a file format and then let all > > processes inside the namespaces exit, the binfmt_misc mount will be > > kept alive by the simple_pin_fs() stuff, and the binfmt_misc entries > > will also stay in memory. > > > > [...] > > Do you have any idea how I can fix this issue? I think the easiest way (keeping in mind that we want to avoid having to fiddle around with reference loops, where e.g. interpreter executable files opened by binfmt_misc have references back to the user namespace through ->f_cred) would be to add a new patch in front of this one that changes the semantics such that when binfmt_misc is unmounted, all the existing format registrations are deleted. That's probably also nicer from the perspective of inspectability. It could in theory break stuff, but I think that's probably somewhat unlikely. Still, it'd be an API change, and therefore you should CC linux-api@ on such a change. > >> @@ -718,7 +736,9 @@ static ssize_t bm_register_write(struct file *file, const char __user *buffer, > >> if (!inode) > >> goto out2; > >> > >> - err = simple_pin_fs(&bm_fs_type, &bm_mnt, &entry_count); > >> + ns = binfmt_ns(file_dentry(file)->d_sb->s_user_ns); > >> + err = simple_pin_fs(&bm_fs_type, &ns->bm_mnt, > >> + &ns->entry_count); > > > > When you call simple_pin_fs() here, the user namespace of `current` > > and the user namespace of the superblock are not necessarily related. > > So simple_pin_fs() may end up taking a reference on the mountpoint for > > a user namespace that has nothing to do with the namespace for which > > an entry is being created. > > Do you have any idea how I can fix this issue? If you fix the memory leak the way I suggested, this wouldn't be a problem anymore either. > >> +static void bm_free(struct fs_context *fc) > >> +{ > >> + if (fc->s_fs_info) > >> + put_user_ns(fc->s_fs_info); > >> +} > > > > Silly question: Why the "if"? Can you ever reach this with fc->s_fs_info==NULL? > > So I understand the if is unnecessary and I will remove it. Your code was actually exactly right, I didn't understand how fc->s_fs_info works. > >> static int bm_get_tree(struct fs_context *fc) > >> { > >> - return get_tree_single(fc, bm_fill_super); > >> + return get_tree_keyed(fc, bm_fill_super, get_user_ns(fc->user_ns)); > > > > get_user_ns() increments the refcount of the namespace, but in the > > case where a binfmt_misc mount already exists, that refcount is never > > dropped, right? That would be a security bug, since an attacker could > > overflow the refcount of the user namespace and then trigger a UAF. > > (And the refcount hardening won't catch it because user namespaces > > still use raw atomics instead of refcount_t.) > > Do you have any idea how I can fix this issue? Ah, this was actually fine. I missed that get_tree_keyed() stashes that pointer in fc->s_fs_info. > >> +#if IS_ENABLED(CONFIG_BINFMT_MISC) > > > > Nit: Isn't this kind of check normally written as "#ifdef"? > > > > What is the difference? As explained in Documentation/process/coding-style.rst and the relevant header, IS_ENABLED() is for inline use in C expressions.