Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Fri, 17 Feb 2017 14:57:49 +1300

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> writes:

> On Thu, 2017-02-16 at 11:42 -0500, Vivek Goyal wrote:
>> On Thu, Feb 16, 2017 at 07:51:58AM -0800, James Bottomley wrote:
>> 
>> [..]
>> > > Two levels of checks will simplify this a bit. Top level inode 
>> > > will belong to the user namespace of caller and checks should 
>> > > pass. And mounter's creds will have ownership over the real inode 
>> > > so no additional namespace shifting required there.
>> > 
>> > That's the problem: for a marked mount, they don't.
>> 
>> In this new model it does not fit directly. 
>> 
>> I was playing with a slightly different approach and modified patches 
>> so that real root still does the mounting and takes an mount option
>> which specifies which user namespace we want to shift into. Thanks to 
>> Eric for the idea.
>> 
>> mount -t shiftfs -o userns_fd=<fd> source shifted-fs

> This is a non-starter because it doesn't work for the unprivileged use
> case, which is what I'm really interested in.

But I believe it does.  It just requires a bit more work for in the
shiftfs filesystem above.  It should be perfectly possible with the help
of newuidmap to create a user namespace with the desired mappings.

My understanding is that Vivek started with requiring root to mount the
filesystem only as a simplification during development, and that the
plan is to get the basic use case working and then allow unprivileged
mounting.

> For fully unprivileged
> containers you don't have an orchestration system to ask to build the
> container.  You can get init scripts to set stuff up for you, like the
> marks, but ideally it should just work even without that (so an inode
> flag following project semantics seems really appealing), but after
> that the unprivileged user should be able to build their own
> containers.
>
> As you saw from the reply to Eric, this approach (which I have tried)
> also opens up a whole can of worms for non-FS_USERNS_MOUNT filesystems.
>

>From what I can see we have two cases we care about.
A) A non-default mapping from the filesystem to the rest of the system
   and roughly s_user_ns provides that but we need a review of the
   filesystems to make certain something has not been forgotten.

B) A filesystem image sitting around in a directory somewhere that
   we want to map differently into different user namespaces while
   using the same files as backing store.

For the second case what is interesting technically is that we want
multiple mappings.  A user namespace appears adequate to specify those
extra mappings (effectively from kuids to kuids).

So we need something to associate the additional mapping with a
directory tree.  A stackable filesystem with it's own s_user_ns field
appears a very straight forward way to do that.  Especially if it can
figure out how to assert that the underlying filesystem image is
read-only (doesn't overlayfs require that?).  Making the entire stack
read-only.

I don't see a problem with that for unprivileged use (except possibly
the read-only enforcement on the unerlying directory tree).

What Vivek is talking about appears to be perfectly correct.  Performing
the underlying filesystem permission checks as the possibly unprivileged
user who mounted shiftfs.  After performing a set of permission checks
(at the shiftfs level) as the user who is accessing the files.

. . .

I think I am missing something but I completely do not understand that
subthread that says use file marks and perform the work in the vfs.
The problem is that fundamentally we need multiple mappings and I don't
see a mark on a file (even an inherited mark) providing the mapping so I
don't see the point.

Which leaves two possible places to store the extra mapping.  In the
struct mount.  Or in a stacked filesystem super_block.  For a stacked
filesystem I can see where to store the extra translation.  In the upper
filesystems upper inode.   And we can perform the practical permission
check on the upper inode as well.

For a vfs level solution it looks like we would have to change all of
the permission checking code in the kernel to have a special case for
this kind of mapping.  Which does not sound maintainable.

So at the moment I don't think a vfs level solution makes any sense.

And then if you have a stacked filesystem with FS_USERNS_MOUNT set so it
can be mounted by an unprivileged user.   I think it makes sense to
check the mounters creds agains the real inode.  To verify the user that
mounted the filesystem has the permission to perform the desired access.

Which makes only allows the mounter as much permisison as the mounter
would have if they performed the work with fuse instead of a special
in-kernel filesystem.

In a DAC model of the world that makes lots of sense.  I don't know what
actually makes sense in a MAC world.  But I am certain that is something
that can be worked through.

Eric