Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> · Wed, 8 Feb 2017 14:45:08 +0300

On 08.02.2017 09:44, Amir Goldstein wrote:
On Wed, Feb 8, 2017 at 1:42 AM, James Bottomley
<James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
On Tue, 2017-02-07 at 14:25 -0800, Christoph Hellwig wrote:
On Tue, Feb 07, 2017 at 11:01:29PM +0200, Amir Goldstein wrote:
Project id's are not exactly "subtree" semantic, but inheritance
semantics,
which is not the same when non empty directories get their project
id changed.
Here is a recap:
https://lwn.net/Articles/623835/

Yes - but if we abuse them for containers we could refine the
semantics to simply not allow change of project ids from inside
containers based on say capabilities.

You mean something like this:
https://lwn.net/Articles/632917/

With the suggested protected_projects, projid 0 (also inside container)
gets a special meaning, much like user 0, so we may do interesting
things with the projid that is mapped to 0.

We can't really abuse projectid, it's part of the user namespace
mapping (for project quota).  What we can do is have a new id that
behaves like it.

Perhaps we *can* use projid without abusing it.
userns already maps projids, but there is no concept of "owning project"
for a userns, nor does it make a lot of sense, because projid is not
part of the credentials.
But if we re-brand it as "container root projid", we can try to use it
for defining semantics to grant unprivileged access to a subtree.

The functionality you are trying to get with shiftfs mark does
sounds a bit like "container root projid":
- inodes with mapped projid MAY be uid/gid shifted
- inodes with unmapped projid MAY NOT

I realize this may be very raw, but its a start. If you like this
direction we can try to develop it.

But like I said, we don't really need a ful ID, it would basically just
be a single bit mark to say remap or not when doing permission checks
against this inode.  It would follow some of the project id semantics
(like inheritance from parent dir)

But a single bit would only work for single level of userns nesting won't it?

I guess we should define the semantics for the required sub-tree
marking, before we can talk about solutions.

Good plan.

So I've been thinking about how to do this without subtree marking and
yet retain the subtree properties similar to project id.  The advantage
would be that if it can be done using only inode properties, then none
of the permission prototypes need change.  The only real subtree
property we need is ability to bind into an unprivileged mount
namespace, but we already have that.  The gotcha about marking inodes
is that they're all or nothing, so every subtree that gets access to
the inode inherits the mark.  This means that we cannot allow a user
access to a marked inode without the cover of an unprivileged user
namespace, but I think that's fixable in the permission check
(basically if the inode is marked you *only* get access if you have a
user_ns != init_user_ns and we do the permission shifts or you have
user_ns == init_user_ns and you are admin capable).

I didn't follow, but it sounds like your proposed solutions is only
good for single level of userns nesting.
Do you think you can redefine it in terms of "container root projid".

Looks like all this started from mangling uid/gid or some other metadata.
As usual, I have to propose funny/insane solutions:
proxify filesystem with fuse and mangle everything in userspace.
Or add some kind of userspace-driver remapping/mangling into overlay,
for example using BPF script (I see it everywhere nowdays).