James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> writes: > On Thu, 2017-02-16 at 11:42 -0500, Vivek Goyal wrote: >> On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote: >> >> [..] >> > > Two levels of checks will simplify this a bit. Top level inode >> > > will belong to the user namespace of caller and checks should >> > > pass. And mounter's creds will have ownership over the real inode >> > > so no additional namespace shifting required there. >> > >> > That's the problem: for a marked mount, they don't. >> >> In this new model it does not fit directly. >> >> I was playing with a slightly different approach and modified patches >> so that real root still does the mounting and takes an mount option >> which specifies which user namespace we want to shift into. Thanks to >> Eric for the idea. >> >> mount -t shiftfs -o userns_fd=<fd> source shifted-fs > This is a non-starter because it doesn't work for the unprivileged use > case, which is what I'm really interested in. But I believe it does. It just requires a bit more work for in the shiftfs filesystem above. It should be perfectly possible with the help of newuidmap to create a user namespace with the desired mappings. My understanding is that Vivek started with requiring root to mount the filesystem only as a simplification during development, and that the plan is to get the basic use case working and then allow unprivileged mounting. > For fully unprivileged > containers you don't have an orchestration system to ask to build the > container. You can get init scripts to set stuff up for you, like the > marks, but ideally it should just work even without that (so an inode > flag following project semantics seems really appealing), but after > that the unprivileged user should be able to build their own > containers. > > As you saw from the reply to Eric, this approach (which I have tried) > also opens up a whole can of worms for non-FS_USERNS_MOUNT filesystems. > >From what I can see we have two cases we care about. A) A non-default mapping from the filesystem to the rest of the system and roughly s_user_ns provides that but we need a review of the filesystems to make certain something has not been forgotten. B) A filesystem image sitting around in a directory somewhere that we want to map differently into different user namespaces while using the same files as backing store. For the second case what is interesting technically is that we want multiple mappings. A user namespace appears adequate to specify those extra mappings (effectively from kuids to kuids). So we need something to associate the additional mapping with a directory tree. A stackable filesystem with it's own s_user_ns field appears a very straight forward way to do that. Especially if it can figure out how to assert that the underlying filesystem image is read-only (doesn't overlayfs require that?). Making the entire stack read-only. I don't see a problem with that for unprivileged use (except possibly the read-only enforcement on the unerlying directory tree). What Vivek is talking about appears to be perfectly correct. Performing the underlying filesystem permission checks as the possibly unprivileged user who mounted shiftfs. After performing a set of permission checks (at the shiftfs level) as the user who is accessing the files. . . . I think I am missing something but I completely do not understand that subthread that says use file marks and perform the work in the vfs. The problem is that fundamentally we need multiple mappings and I don't see a mark on a file (even an inherited mark) providing the mapping so I don't see the point. Which leaves two possible places to store the extra mapping. In the struct mount. Or in a stacked filesystem super_block. For a stacked filesystem I can see where to store the extra translation. In the upper filesystems upper inode. And we can perform the practical permission check on the upper inode as well. For a vfs level solution it looks like we would have to change all of the permission checking code in the kernel to have a special case for this kind of mapping. Which does not sound maintainable. So at the moment I don't think a vfs level solution makes any sense. And then if you have a stacked filesystem with FS_USERNS_MOUNT set so it can be mounted by an unprivileged user. I think it makes sense to check the mounters creds agains the real inode. To verify the user that mounted the filesystem has the permission to perform the desired access. Which makes only allows the mounter as much permisison as the mounter would have if they performed the work with fuse instead of a special in-kernel filesystem. In a DAC model of the world that makes lots of sense. I don't know what actually makes sense in a MAC world. But I am certain that is something that can be worked through. Eric