On Wed, Feb 12, 2020 at 3:51 PM Christian Brauner <christian.brauner@xxxxxxxxxx> wrote: > On Tue, Feb 11, 2020 at 09:55:46PM +0100, Jann Horn via Containers wrote: > > On Tue, Feb 11, 2020 at 5:59 PM Christian Brauner > > <christian.brauner@xxxxxxxxxx> wrote: > > > This is the implementation of shiftfs which was cooked up during lunch at > > > Linux Plumbers 2019 the day after the container's microconference. The > > > idea is a design-stew from Stéphane, Aleksa, Eric, and myself. Back then > > > we all were quite busy with other work and couldn't really sit down and > > > implement it. But I took a few days last week to do this work, including > > > demos and performance testing. > > > This implementation does not require us to touch the vfs substantially > > > at all. Instead, we implement shiftfs via fsid mappings. > > > With this patch, it took me 20 mins to port both LXD and LXC to support > > > shiftfs via fsid mappings. > > > > > > For anyone wanting to play with this the branch can be pulled from: > > > https://github.com/brauner/linux/tree/fsid_mappings > > > https://gitlab.com/brauner/linux/-/tree/fsid_mappings > > > https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=fsid_mappings > > > > > > The main use case for shiftfs for us is in allowing shared writable > > > storage to multiple containers using non-overlapping id mappings. > > > In such a scenario you want the fsids to be valid and identical in both > > > containers for the shared mount. A demo for this exists in [3]. > > > If you don't want to read on, go straight to the other demos below in > > > [1] and [2]. > > > > I guess essentially this means that you want to have UID separation > > between containers to prevent the containers - or their owners - from > > interfering between each other, but for filesystem access, you don't > > want to isolate them from each other using DAC controls on the files > > and folders inside the containers' directory hierarchies, instead > > relying on mode-0700 parent directories to restrict access to the > > container owner? Or would you still have separate UIDs for e.g. the > > container's UID range 0-65535, and then map the shared UID range at > > 100000, or something like that? > > Yes. > So if you look at the permissions right now for the directory under > which the rootfs for the container and other stuff resides we have > root@wittgenstein|/var/lib/lxd/storage-pools/zfs/containers > > perms * > d--x------ 100 alp1 > d--x------ 100 f1 > d--x------ 100 f2 > > We don't really share the rootfs between containers right now since we > treat them as standalone systems but with fsid mappings that's possible > too. Layer-sharing-centric runtimes very much will want something like > that. [...] > > > With this patch series we simply introduce the ability to create fsid > > > mappings that are different from the id mappings of a user namespace. > > > > > > In the usual case of running an unprivileged container we will have > > > setup an id mapping, e.g. 0 100000 100000. The on-disk mapping will > > > correspond to this id mapping, i.e. all files which we want to appear as > > > 0:0 inside the user namespace will be chowned to 100000:100000 on the > > > host. This works, because whenever the kernel needs to do a filesystem > > > access it will lookup the corresponding uid and gid in the idmapping > > > tables of the container. > > > Now think about the case where we want to have an id mapping of 0 100000 > > > 100000 but an on-disk mapping of 0 300000 100000 which is needed to e.g. > > > share a single on-disk mapping with multiple containers that all have > > > different id mappings. > > > This will be problematic. Whenever a filesystem access is requested, the > > > kernel will now try to lookup a mapping for 300000 in the id mapping > > > tables of the user namespace but since there is none the files will > > > appear to be owned by the overflow id, i.e. usually 65534:65534 or > > > nobody:nogroup. > > > > > > With fsid mappings we can solve this by writing an id mapping of 0 > > > 100000 100000 and an fsid mapping of 0 300000 100000. On filesystem > > > access the kernel will now lookup the mapping for 300000 in the fsid > > > mapping tables of the user namespace. And since such a mapping exists, > > > the corresponding files will have correct ownership. > > > > Sorry to bring up something as disgusting as setuid execution, but: > > No that's exactly what this needs. :) > > > What happens when there's a setuid root file with ->i_uid==300000? I > > guess the only way to make that work inside the containers would be > > something like make_kuid(current_user_ns(), > > from_kfsuid(current_user_ns(), inode->i_uid)) in the setuid execve > > path? > > What's the specific callpath you're thinking about? > > So if you look at patch > https://lore.kernel.org/lkml/20200211165753.356508-16-christian.brauner@xxxxxxxxxx/ > it does > - new->suid = new->fsuid = new->euid; > - new->sgid = new->fsgid = new->egid; > + fsuid = from_kuid_munged(new->user_ns, new->euid); > + kfsuid = make_kfsuid(new->user_ns, fsuid); > + new->suid = new->euid; > + new->fsuid = kfsuid; > + > + fsgid = from_kgid_munged(new->user_ns, new->egid); > + kfsgid = make_kfsgid(new->user_ns, fsgid); > + new->sgid = new->egid; > + new->fsgid = kfsgid; Aaah, okay, I missed that. > One thing I definitely missed though in the setuid path is to adapt > fs/exec.c:bprm_fill_uid(): > > diff --git a/fs/exec.c b/fs/exec.c > index 74d88dab98dd..ad839934fdff 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1547,8 +1547,8 @@ static void bprm_fill_uid(struct linux_binprm *bprm) > inode_unlock(inode); > > /* We ignore suid/sgid if there are no mappings for them in the ns */ > - if (!kuid_has_mapping(bprm->cred->user_ns, uid) || > - !kgid_has_mapping(bprm->cred->user_ns, gid)) > + if (!kfsuid_has_mapping(bprm->cred->user_ns, uid) || > + !kfsgid_has_mapping(bprm->cred->user_ns, gid)) > return; > > if (mode & S_ISUID) { [...] > > I want to open /proc/$pid/personality of another process with the same > > UIDs, may_open() will call inode_permission() -> do_inode_permission() > > -> generic_permission() -> acl_permission_check(), which will compare > > current_fsuid() (which is 300000) against inode->i_uid. But > > inode->i_uid was filled by proc_pid_make_inode()->task_dump_owner(), > > which set inode->i_uid to 100000, right? > > Yes. That should be fixable by something like below, I think. (And we can > probably shortcut this by adding a helper that does tell us whether there's > been any fsid mapping setup or not for this user namespace.) > static int acl_permission_check(struct inode *inode, int mask) > { > + kuid_t kuid; > unsigned int mode = inode->i_mode; > > - if (likely(uid_eq(current_fsuid(), inode->i_uid))) > + if (!is_userns_visible(inode->i_sb->s_iflags)) { > + kuid = inode->i_uid; > + } else { > + kuid = make_kuid(current_user_ns(), > + from_kfsuid(current_user_ns(), inode->i_uid)); > + } > + > + if (likely(uid_eq(current_fsuid(), kuid))) > mode >>= 6; > else {&& (mode & S_IRWXG)) { > > > > > Also, e.g. __ptrace_may_access() uses cred->fsuid for a comparison > > with another task's real/effective/saved UID. > > Right, you even introduced this check in 2015 iirc. > Both of your points make me think that it'd be easiest to introduce > cred->{kfsuid,kfsgid} and whenever an access decision on a > is_userns_visible() filesystem has to be made those will be used. This avoids > having to do on-the fly translations I guess that might be less ugly. > and ptrace_may_access() can just grow a > flag indicating what fscreds it's supposed to look at? Wouldn't you always end up using the "real" fsuid there?