Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Sun, 05 Feb 2017 17:18:11 -0800

On Sun, 2017-02-05 at 09:51 +0200, Amir Goldstein wrote:
> On Sat, Feb 4, 2017 at 9:19 PM, James Bottomley
> <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > This allows any subtree to be uid/gid shifted and bound elsewhere. 
> >  It does this by operating simlarly to overlayfs.  Its primary use 
> > is for shifting the underlying uids of filesystems used to support
> > unpriviliged (uid shifted) containers.  The usual use case here is
> > that the container is operating with an uid shifted unprivileged 
> > root but sometimes needs to make use of or work with a filesystem 
> > image that has root at real uid 0.
> > 
> > The mechanism is to allow any subordinate mount namespace to mount 
> > a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only
> > allowing it to mount marked subtrees (using the -o mark option as 
> > root).   Once mounted, the subtree is mapped via the super block 
> > user namespace so that the interior ids of the mounting user 
> > namespace are the ids written to the filesystem.
> > 
> > Signed-off-by: James Bottomley <
> > James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>
> > 
> 
> James,
> 
> Allow me to point out some problems in this patch and offer a 
> slightly different approach.
> 
> First of all, the subject says "uid/gid shifting bind mount", but 
> it's not really a bind mount. What it is is a stackable mount and 2 
> levels of stack no less.

The reason for the description is to have it behave exactly like a bind
mount.  You can assert that a bind mount is, in fact, a stacked mount,
but we don't currently.  I'm also not sure where you get your 2 levels
from?

>  So one thing that is missing is increasing of sb->s_stack_depth and
> that also means that shiftfs cannot be used to recursively shift uids
> in child userns if that was ever the intention.

I can't think of a use case that would ever need that, but perhaps
other container people can.

> The other problem is that by forking overlayfs functionality,

So this wouldn't really be the right way to look at it: shiftfs shares
no code with overlayfs at all, so is definitely not a fork.  The only
piece of functionality it has which is similar to overlayfs is the way
it does lookups via a new dentry cache.  However, that functionality is
not unique to overlayfs and if you look, you'll see that
shiftfs_lookup() actually has far more in common with
ecryptfs_lookup().

>  shiftfs is going to miss out on overlayfs bug fixes related to user 
> credentials differ from mounter credentials, like fd3220d ("ovl: 
> update S_ISGID when setting posix ACLs"). I am not sure that this 
> specific case is relevant to shiftfs, but there could be other.

OK, so shiftfs doesn't have this bug and the reason why is
illustrative: basically shiftfs does three things

   1. lookups via a uid/gid shifted dentry cache
   2. shifted credential inode operations permission checks on the
      underlying filesystem
   3. location marking for unprivileged mount

I think we've already seen that 1. isn't from overlayfs but the
functionality could be added to overlayfs, I suppose.  The big problem
is 2.  The overlayfs code emulates the permission checks, which makes
it rather complex (this is where you get your bugs like the above
from).  I did actually look at adding 2. to overlayfs on the theory
that a single layer overlay might be closest to what this is, but
eventually concluded I'd have to take the special cases and add a whole
lot more to them ... it really would increase the maintenance burden
substantially and make the code an unreadable rats nest.

When you think about it this way, it becomes obvious that the clean
separation is if shiftfs functionality is layered on top of overlayfs
and when you do that, doing it as its own filesystem is more logical.

> So how about, instead of forking a new containers specialized 
> stackable fs, that the needed functionality be merged into overlayfs 
> code? I think overlayfs container users may also benefit from shiftfs
> functionality, no?

I think I covered the why not merge the code above.  As to the
functionality, since Docker already has a graph driver, the graph
driver can do the shifting on top of the overlays.

>  In any case, overlayfs has considerable millage used as fs for
> containers, so many issues related to running with different userns
> may have already been addressed.

Overlayfs is s_user_ns blind so it's highly unlikely to have seen any
issues with the user namespaces, let alone addressed them.  This will
also be compounded by the fact that its primary user: docker, has
rather a weak use of the user namespace currently.

The other thing is the use case: Most immutable infrastructure
container systems create the overlays in the host and then bind them
into the container.  This binding is an additional mount operation. 
 Now the could mount from an overlay as an overlay but it's adding
complexity because the container itself cannot control the overlay
(it's a host provided thing) so it is definitely cleaner to make the
second mount a different filesystem (i.e. shiftfs) where the nature of
the overlay is hidden from the container.

> Overlayfs already stores the mounter's credentials and uses them to 
> perform most of the operations on upper.

OK, that's case 2. again.  So I think you may be labouring under the
misapprehension that shiftfs and overlayfs do the same thing with
override credentials?  They don't: overlayfs emulates the permission
lookups and then overrides based on *historical* admin credentials to
force what it's already decided on the underlying fielsystems.  Shiftfs
overrides the *current* credentials with a uid/gid and namespace shift
and then runs the permission checks.  Thus if I wanted to add what
shiftfs does to overlayfs, I'd have to add another load of overriding
based on current credentials in the currently unoverriden emulated
permission checks.  I think you can see that simply running the real
permission checks on the underlying filesystem with overridden
credentials is much simpler.

> I know it wasn't the original purpose of overlayfs to run as a single 
> layer, but there is nothing really preventing from doing that. In 
> fact, I am doing just that with my snapshot mount patches, see:
> https://github.com/amir73il/linux/commit/acc6c25eab03c176c9ef736544fa
> b3fba663765d#diff-2b85a3c5bea4263d08a2bdff639192c3
> I registered a new fs type ("snapshot"), which reuses most of the 
> existing overlayfs operations. With this patch it is possible to 
> mount an overlay with only upper layer, so all the operations are
> pass through except for the credentials, e.g.:
> 
> mount -t snapshot -o upper=<origin> shiftfs_test <mark location>

OK, so since you don't need to special case the permission checks, I
can see why this might work for you because you don't need to modify
overlayfs to do this.  Since I can't consume the overlay code as is, it
doesn't work for me because I'd have to add lots of special case code
to it.

James

> If you think this concept is workable, then the functionality of
> mounting overlayfs with only upper should be integrated into plain 
> overlayfs and shiftfs could be a very thin variant of overlayfs mount 
> using shitfs_fs_type, just for the sake of having FS_USERNS_MOUNT,
> e.g:
> 
> + /*
> +  * XXX: reusing ovl_mount()/ovl_fill_super(), but could also just
> reuse
> +  *
> ovl_dentry_operations/ovl_super_operations/ovl_xattr_handlers/ovl_new
> _inode()
> +  */
> +static struct file_system_type shiftfs_type = {
> +       .owner          = THIS_MODULE,
> +       .name           = "shiftfs",
> +       .mount          = ovl_mount,
> +       .kill_sb        = kill_anon_super,
> +       .fs_flags       = FS_USERNS_MOUNT,
> +};
> +MODULE_ALIAS_FS("shiftfs");
> +MODULE_ALIAS("shiftfs");
> +#define IS_SHIFTFS_SB(sb) ((sb)->s_type == &shiftfs_type)
> 
> And instead of verifying that shiftfs is mounted inside container
> over shiftfs,
> verify that it is mounted over an overlayfs noexec mount e.g.:
> 
> +       if (IS_SHIFTFS_SB(sb)) {
> +               /*
> +                * this leg executes if we're admin capable in
> +                * the namespace, so be very careful
> +                */
> +               if (path.dentry->d_sb->s_magic != OVERLAYFS_MAGIC ||
> !(path.dentry->d_sb->s_iflags & SB_I_NOEXEC))
> +                       goto out_put;
> 
> From users manual POV:
> 
> in host:
> mount -t overlay -o noexec,upper=<origin> container_visible <mark
> location>
> 
> in container:
> mount -t shiftfs -o upper=<mark location> container_writable
> <somewhere in my local mount ns>
> 
> Thought?
>