Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Mon, 06 Feb 2017 06:41:22 -0800

On Mon, 2017-02-06 at 08:59 +0200, Amir Goldstein wrote:
> On Mon, Feb 6, 2017 at 3:18 AM, James Bottomley
> <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > On Sun, 2017-02-05 at 09:51 +0200, Amir Goldstein wrote:
> > > On Sat, Feb 4, 2017 at 9:19 PM, James Bottomley
> > > <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > > > This allows any subtree to be uid/gid shifted and bound 
> > > > elsewhere.  It does this by operating simlarly to overlayfs. 
> > > >  Its primary use is for shifting the underlying uids of 
> > > > filesystems used to support unpriviliged (uid shifted) 
> > > > containers.  The usual use case here is that the container is 
> > > > operating with an uid shifted unprivileged root but sometimes 
> > > > needs to make use of or work with a filesystem image that has
> > > > root at real uid 0.
> > > > 
> > > > The mechanism is to allow any subordinate mount namespace to 
> > > > mount a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but 
> > > > only allowing it to mount marked subtrees (using the -o mark 
> > > > option as root).   Once mounted, the subtree is mapped via the 
> > > > super block user namespace so that the interior ids of the 
> > > > mounting user namespace are the ids written to the filesystem.
> > > > 
> > > > Signed-off-by: James Bottomley <
> > > > James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>
> > > > 
> > > 
> > > James,
> > > 
> > > Allow me to point out some problems in this patch and offer a
> > > slightly different approach.
> > > 
> > > First of all, the subject says "uid/gid shifting bind mount", but
> > > it's not really a bind mount. What it is is a stackable mount and 
> > > 2 levels of stack no less.
> > 
> > The reason for the description is to have it behave exactly like a 
> > bind mount.  You can assert that a bind mount is, in fact, a 
> > stacked mount, but we don't currently.  I'm also not sure where you 
> > get your 2 levels from?
> > 
> 
> A bind mount does not incur recursion into VFS code, a stacked fs 
> does. And there is a programmable limit of stack depth of 2, which 
> stacked fs need to comply with. Your proposed setup has 2 stacked fs, 
> the mark shitfs by admin and the uid shitfs by container user. Or
> maybe I misunderstood.

Oh, right, actually, it wouldn't be 2 because once the unprivileged
mount uses the marked filesystem, what it uses is the mnt and dentry
from the underlying filesystem (what you would have got from a path
lookup on it).

That said, it does perform recursive calls to the underlying filesystem
unlike a true bind mount, so I can add the depth easily enough.

> > >  So one thing that is missing is increasing of sb->s_stack_depth 
> > > and that also means that shiftfs cannot be used to recursively 
> > > shift uids in child userns if that was ever the intention.
> > 
> > I can't think of a use case that would ever need that, but perhaps
> > other container people can.
> > 
> > > The other problem is that by forking overlayfs functionality,
> > 
> > So this wouldn't really be the right way to look at it: shiftfs 
> > shares no code with overlayfs at all, so is definitely not a fork. 
> >  The only piece of functionality it has which is similar to 
> > overlayfs is the way it does lookups via a new dentry cache. 
> >  However, that functionality is not unique to overlayfs and if you 
> > look, you'll see that shiftfs_lookup() actually has far more in 
> > common with ecryptfs_lookup().
> 
> That's a good point. All stackable file systems may share similar 
> problems and solutions (e.g. consistent st_ino/st_dev). Perhaps it 
> calls for shared library code or more generic VFS code. At the moment 
> ecryptfs is not seeing much development, so everything happens in 
> overlayfs. If there is going to be more than 1 actively developed
> stackable fs, we need to see about that.

I believe we already do ... if you look at the lookup functions of each
of them, you see the only common thing is encapsulated in a variant of
the lookup_one_len() functions.  After that, even simple things like
our negative dentry handling differs.

> > >  shiftfs is going to miss out on overlayfs bug fixes related to 
> > > user credentials differ from mounter credentials, like fd3220d 
> > > ("ovl: update S_ISGID when setting posix ACLs"). I am not sure 
> > > that this specific case is relevant to shiftfs, but there could
> > > be other.
> > 
> > OK, so shiftfs doesn't have this bug and the reason why is
> > illustrative: basically shiftfs does three things
> > 
> >    1. lookups via a uid/gid shifted dentry cache
> >    2. shifted credential inode operations permission checks on the
> >       underlying filesystem
> >    3. location marking for unprivileged mount
> > 
> > I think we've already seen that 1. isn't from overlayfs but the
> > functionality could be added to overlayfs, I suppose.  The big 
> > problem is 2.  The overlayfs code emulates the permission checks, 
> > which makes it rather complex (this is where you get your bugs like 
> > the above from).  I did actually look at adding 2. to overlayfs on 
> > the theory that a single layer overlay might be closest to what 
> > this is, but eventually concluded I'd have to take the special 
> > cases and add a whole lot more to them ... it really would increase 
> > the maintenance burden substantially and make the code an
> > unreadable rats nest.
> > 
> 
> The use cases for uid shifting are still overwelming for me.
> I take your word for it that its going to be a maintanace burdon
> to add this functionality to overlayfs.
> 
> > When you think about it this way, it becomes obvious that the clean
> > separation is if shiftfs functionality is layered on top of 
> > overlayfs and when you do that, doing it as its own filesystem is 
> > more logical.
> > 
> 
> Yes, I agree with that statement. This is inline with the solution I 
> outlined at the end of my previous email, where single layer 
> overlayfs is used for the host "mark" mount, although I wonder if the 
> same cannot be achieved with a bind mount?

I understand, but once I can't consume overlayfs to construct it, the
idea of trying to use it becomes a negative not a positive.

We could achieve the same thing using bind mounts, if the vfsmount
structure carried a private field, but it doesn't.  I think given the
prevalence of this structure throughout the mount tree, that's a
deliberate decision to keep it thin.

> in host:
> mount -t overlay -o noexec,upper=<origin> container_visible <mark
> location>
> 
> in container:
> mount -t shiftfs -o <mark location> <somewhere in my local mount ns>

So I'm not sure it's a more widespread problem: mount --bind is usable
inside an unprivileged container, which means you can bridge filesystem
subtrees even only being local container admin.  The problem is
mounting other filesystems types.  Marking a type safe for mounting is
done by the FS_USERNS_MOUNT flag but it means for things like shiftfs
that you do have to restrict the source location, but for most
filesystem types, that source will be a device, so they will need other
checking than a mount mark.

James