On Mon, 2017-02-06 at 08:59 +0200, Amir Goldstein wrote: > On Mon, Feb 6, 2017 at 3:18 AM, James Bottomley > <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > > On Sun, 2017-02-05 at 09:51 +0200, Amir Goldstein wrote: > > > On Sat, Feb 4, 2017 at 9:19 PM, James Bottomley > > > <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > > This allows any subtree to be uid/gid shifted and bound > > > > elsewhere. It does this by operating simlarly to overlayfs. > > > > Its primary use is for shifting the underlying uids of > > > > filesystems used to support unpriviliged (uid shifted) > > > > containers. The usual use case here is that the container is > > > > operating with an uid shifted unprivileged root but sometimes > > > > needs to make use of or work with a filesystem image that has > > > > root at real uid 0. > > > > > > > > The mechanism is to allow any subordinate mount namespace to > > > > mount a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but > > > > only allowing it to mount marked subtrees (using the -o mark > > > > option as root). Once mounted, the subtree is mapped via the > > > > super block user namespace so that the interior ids of the > > > > mounting user namespace are the ids written to the filesystem. > > > > > > > > Signed-off-by: James Bottomley < > > > > James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> > > > > > > > > > > James, > > > > > > Allow me to point out some problems in this patch and offer a > > > slightly different approach. > > > > > > First of all, the subject says "uid/gid shifting bind mount", but > > > it's not really a bind mount. What it is is a stackable mount and > > > 2 levels of stack no less. > > > > The reason for the description is to have it behave exactly like a > > bind mount. You can assert that a bind mount is, in fact, a > > stacked mount, but we don't currently. I'm also not sure where you > > get your 2 levels from? > > > > A bind mount does not incur recursion into VFS code, a stacked fs > does. And there is a programmable limit of stack depth of 2, which > stacked fs need to comply with. Your proposed setup has 2 stacked fs, > the mark shitfs by admin and the uid shitfs by container user. Or > maybe I misunderstood. Oh, right, actually, it wouldn't be 2 because once the unprivileged mount uses the marked filesystem, what it uses is the mnt and dentry from the underlying filesystem (what you would have got from a path lookup on it). That said, it does perform recursive calls to the underlying filesystem unlike a true bind mount, so I can add the depth easily enough. > > > So one thing that is missing is increasing of sb->s_stack_depth > > > and that also means that shiftfs cannot be used to recursively > > > shift uids in child userns if that was ever the intention. > > > > I can't think of a use case that would ever need that, but perhaps > > other container people can. > > > > > The other problem is that by forking overlayfs functionality, > > > > So this wouldn't really be the right way to look at it: shiftfs > > shares no code with overlayfs at all, so is definitely not a fork. > > The only piece of functionality it has which is similar to > > overlayfs is the way it does lookups via a new dentry cache. > > However, that functionality is not unique to overlayfs and if you > > look, you'll see that shiftfs_lookup() actually has far more in > > common with ecryptfs_lookup(). > > That's a good point. All stackable file systems may share similar > problems and solutions (e.g. consistent st_ino/st_dev). Perhaps it > calls for shared library code or more generic VFS code. At the moment > ecryptfs is not seeing much development, so everything happens in > overlayfs. If there is going to be more than 1 actively developed > stackable fs, we need to see about that. I believe we already do ... if you look at the lookup functions of each of them, you see the only common thing is encapsulated in a variant of the lookup_one_len() functions. After that, even simple things like our negative dentry handling differs. > > > shiftfs is going to miss out on overlayfs bug fixes related to > > > user credentials differ from mounter credentials, like fd3220d > > > ("ovl: update S_ISGID when setting posix ACLs"). I am not sure > > > that this specific case is relevant to shiftfs, but there could > > > be other. > > > > OK, so shiftfs doesn't have this bug and the reason why is > > illustrative: basically shiftfs does three things > > > > 1. lookups via a uid/gid shifted dentry cache > > 2. shifted credential inode operations permission checks on the > > underlying filesystem > > 3. location marking for unprivileged mount > > > > I think we've already seen that 1. isn't from overlayfs but the > > functionality could be added to overlayfs, I suppose. The big > > problem is 2. The overlayfs code emulates the permission checks, > > which makes it rather complex (this is where you get your bugs like > > the above from). I did actually look at adding 2. to overlayfs on > > the theory that a single layer overlay might be closest to what > > this is, but eventually concluded I'd have to take the special > > cases and add a whole lot more to them ... it really would increase > > the maintenance burden substantially and make the code an > > unreadable rats nest. > > > > The use cases for uid shifting are still overwelming for me. > I take your word for it that its going to be a maintanace burdon > to add this functionality to overlayfs. > > > When you think about it this way, it becomes obvious that the clean > > separation is if shiftfs functionality is layered on top of > > overlayfs and when you do that, doing it as its own filesystem is > > more logical. > > > > Yes, I agree with that statement. This is inline with the solution I > outlined at the end of my previous email, where single layer > overlayfs is used for the host "mark" mount, although I wonder if the > same cannot be achieved with a bind mount? I understand, but once I can't consume overlayfs to construct it, the idea of trying to use it becomes a negative not a positive. We could achieve the same thing using bind mounts, if the vfsmount structure carried a private field, but it doesn't. I think given the prevalence of this structure throughout the mount tree, that's a deliberate decision to keep it thin. > in host: > mount -t overlay -o noexec,upper=<origin> container_visible <mark > location> > > in container: > mount -t shiftfs -o <mark location> <somewhere in my local mount ns> So I'm not sure it's a more widespread problem: mount --bind is usable inside an unprivileged container, which means you can bridge filesystem subtrees even only being local container admin. The problem is mounting other filesystems types. Marking a type safe for mounting is done by the FS_USERNS_MOUNT flag but it means for things like shiftfs that you do have to restrict the source location, but for most filesystem types, that source will be a device, so they will need other checking than a mount mark. James