On Sun, 2017-02-05 at 09:51 +0200, Amir Goldstein wrote: > On Sat, Feb 4, 2017 at 9:19 PM, James Bottomley > <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > > This allows any subtree to be uid/gid shifted and bound elsewhere. > > It does this by operating simlarly to overlayfs. Its primary use > > is for shifting the underlying uids of filesystems used to support > > unpriviliged (uid shifted) containers. The usual use case here is > > that the container is operating with an uid shifted unprivileged > > root but sometimes needs to make use of or work with a filesystem > > image that has root at real uid 0. > > > > The mechanism is to allow any subordinate mount namespace to mount > > a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but only > > allowing it to mount marked subtrees (using the -o mark option as > > root). Once mounted, the subtree is mapped via the super block > > user namespace so that the interior ids of the mounting user > > namespace are the ids written to the filesystem. > > > > Signed-off-by: James Bottomley < > > James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> > > > > James, > > Allow me to point out some problems in this patch and offer a > slightly different approach. > > First of all, the subject says "uid/gid shifting bind mount", but > it's not really a bind mount. What it is is a stackable mount and 2 > levels of stack no less. The reason for the description is to have it behave exactly like a bind mount. You can assert that a bind mount is, in fact, a stacked mount, but we don't currently. I'm also not sure where you get your 2 levels from? > So one thing that is missing is increasing of sb->s_stack_depth and > that also means that shiftfs cannot be used to recursively shift uids > in child userns if that was ever the intention. I can't think of a use case that would ever need that, but perhaps other container people can. > The other problem is that by forking overlayfs functionality, So this wouldn't really be the right way to look at it: shiftfs shares no code with overlayfs at all, so is definitely not a fork. The only piece of functionality it has which is similar to overlayfs is the way it does lookups via a new dentry cache. However, that functionality is not unique to overlayfs and if you look, you'll see that shiftfs_lookup() actually has far more in common with ecryptfs_lookup(). > shiftfs is going to miss out on overlayfs bug fixes related to user > credentials differ from mounter credentials, like fd3220d ("ovl: > update S_ISGID when setting posix ACLs"). I am not sure that this > specific case is relevant to shiftfs, but there could be other. OK, so shiftfs doesn't have this bug and the reason why is illustrative: basically shiftfs does three things 1. lookups via a uid/gid shifted dentry cache 2. shifted credential inode operations permission checks on the underlying filesystem 3. location marking for unprivileged mount I think we've already seen that 1. isn't from overlayfs but the functionality could be added to overlayfs, I suppose. The big problem is 2. The overlayfs code emulates the permission checks, which makes it rather complex (this is where you get your bugs like the above from). I did actually look at adding 2. to overlayfs on the theory that a single layer overlay might be closest to what this is, but eventually concluded I'd have to take the special cases and add a whole lot more to them ... it really would increase the maintenance burden substantially and make the code an unreadable rats nest. When you think about it this way, it becomes obvious that the clean separation is if shiftfs functionality is layered on top of overlayfs and when you do that, doing it as its own filesystem is more logical. > So how about, instead of forking a new containers specialized > stackable fs, that the needed functionality be merged into overlayfs > code? I think overlayfs container users may also benefit from shiftfs > functionality, no? I think I covered the why not merge the code above. As to the functionality, since Docker already has a graph driver, the graph driver can do the shifting on top of the overlays. > In any case, overlayfs has considerable millage used as fs for > containers, so many issues related to running with different userns > may have already been addressed. Overlayfs is s_user_ns blind so it's highly unlikely to have seen any issues with the user namespaces, let alone addressed them. This will also be compounded by the fact that its primary user: docker, has rather a weak use of the user namespace currently. The other thing is the use case: Most immutable infrastructure container systems create the overlays in the host and then bind them into the container. This binding is an additional mount operation. Now the could mount from an overlay as an overlay but it's adding complexity because the container itself cannot control the overlay (it's a host provided thing) so it is definitely cleaner to make the second mount a different filesystem (i.e. shiftfs) where the nature of the overlay is hidden from the container. > Overlayfs already stores the mounter's credentials and uses them to > perform most of the operations on upper. OK, that's case 2. again. So I think you may be labouring under the misapprehension that shiftfs and overlayfs do the same thing with override credentials? They don't: overlayfs emulates the permission lookups and then overrides based on *historical* admin credentials to force what it's already decided on the underlying fielsystems. Shiftfs overrides the *current* credentials with a uid/gid and namespace shift and then runs the permission checks. Thus if I wanted to add what shiftfs does to overlayfs, I'd have to add another load of overriding based on current credentials in the currently unoverriden emulated permission checks. I think you can see that simply running the real permission checks on the underlying filesystem with overridden credentials is much simpler. > I know it wasn't the original purpose of overlayfs to run as a single > layer, but there is nothing really preventing from doing that. In > fact, I am doing just that with my snapshot mount patches, see: > https://github.com/amir73il/linux/commit/acc6c25eab03c176c9ef736544fa > b3fba663765d#diff-2b85a3c5bea4263d08a2bdff639192c3 > I registered a new fs type ("snapshot"), which reuses most of the > existing overlayfs operations. With this patch it is possible to > mount an overlay with only upper layer, so all the operations are > pass through except for the credentials, e.g.: > > mount -t snapshot -o upper=<origin> shiftfs_test <mark location> OK, so since you don't need to special case the permission checks, I can see why this might work for you because you don't need to modify overlayfs to do this. Since I can't consume the overlay code as is, it doesn't work for me because I'd have to add lots of special case code to it. James > If you think this concept is workable, then the functionality of > mounting overlayfs with only upper should be integrated into plain > overlayfs and shiftfs could be a very thin variant of overlayfs mount > using shitfs_fs_type, just for the sake of having FS_USERNS_MOUNT, > e.g: > > + /* > + * XXX: reusing ovl_mount()/ovl_fill_super(), but could also just > reuse > + * > ovl_dentry_operations/ovl_super_operations/ovl_xattr_handlers/ovl_new > _inode() > + */ > +static struct file_system_type shiftfs_type = { > + .owner = THIS_MODULE, > + .name = "shiftfs", > + .mount = ovl_mount, > + .kill_sb = kill_anon_super, > + .fs_flags = FS_USERNS_MOUNT, > +}; > +MODULE_ALIAS_FS("shiftfs"); > +MODULE_ALIAS("shiftfs"); > +#define IS_SHIFTFS_SB(sb) ((sb)->s_type == &shiftfs_type) > > And instead of verifying that shiftfs is mounted inside container > over shiftfs, > verify that it is mounted over an overlayfs noexec mount e.g.: > > + if (IS_SHIFTFS_SB(sb)) { > + /* > + * this leg executes if we're admin capable in > + * the namespace, so be very careful > + */ > + if (path.dentry->d_sb->s_magic != OVERLAYFS_MAGIC || > !(path.dentry->d_sb->s_iflags & SB_I_NOEXEC)) > + goto out_put; > > From users manual POV: > > in host: > mount -t overlay -o noexec,upper=<origin> container_visible <mark > location> > > in container: > mount -t shiftfs -o upper=<mark location> container_writable > <somewhere in my local mount ns> > > Thought? >