Re: shiftfs status and future development

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Mon, 18 Jun 2018 07:56:59 -0700

On Mon, 2018-06-18 at 08:40 -0500, Seth Forshee wrote:
> On Fri, Jun 15, 2018 at 08:03:05PM -0700, James Bottomley wrote:
> > On Fri, 2018-06-15 at 09:59 -0500, Seth Forshee wrote:
> > > On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote:
> > > > Quoting Seth Forshee (seth.forshee@xxxxxxxxxxxxx):
> > > > > I wanted to inquire about the current status of shiftfs and
> > > > > the plans for it moving forward. We'd like to have this
> > > > > functionality available for use in lxd, and I'm interesetd in
> > > > > helping with development (or picking up development if it's
> > > > > stalled).
> > > > > 
> > > > > To start, is anyone still working on shiftfs or similar
> > > > > functionality? I haven't found it in any git tree on
> > > > > kernel.org, and as far as mailing list activity the last
> > > > > submission I can find is [1]. Is there anything newer than
> > > > > this?
> > > > > 
> > > > > Based on past mailing list discussions, it seems like there
> > > > > was still debate as to whether this feature should be an
> > > > > overlay filesystem or something supported at the vfs level.
> > > > > Was this ever resolved?
> > > > > 
> > > > > Thanks,
> > > > > Seth
> > > > > 
> > > > > [1]
> > > > > http://lkml.kernel.org/r/1487638025.2337.49.camel@HansenPartn
> > > > > ership.com
> > > > 
> > > > Hey Seth,
> > > > 
> > > > I haven't heard anything in a long time.  But if this is going
> > > > to pick back up, can we come up with a detailed set of goals
> > > > and requirements?
> > 
> > That would actually help.
> > 
> > > I was planning to follow up later with some discussion of
> > > requirements. Here are some of ours:
> > > 
> > >  - Supports any id maps possible for a user namespace
> > 
> > Could you clarify: right at the moment, it basically reverses the
> > namespace ID mapping when it does on to the filesystem using the
> > superblock user namespace, so, in theory you can have an arbitrary
> > mapping simply by changing the s_userns.  The problem here is that
> > you don't have a lot of tools for manipulating the s_userns.
> 
> For our purposes the way you're shifting with s_user_ns works fine. I
> know that Serge would prefer a more arbitrary shift so that an
> arbitrary, unprivileged range in the source fs could be use (e.g. use
> ids 100000 - 101000 in the source instead of 0 - 1000), and my
> thoughts on that are quoted below.

The original (v1) shiftfs did simply take a range of ids to shift as an
argument.  However, that one could only be set up by root and Eric
expressed a desire that it use the s_user_ns.

> > >  - Does not break inotify
> > 
> > I don't expect it does, but I haven't checked.
> 
> I haven't checked either; I'm planning to do so soon. This is a
> concern that was expressed to me by others, I think because inotify
> doesn't work with overlayfs.

I think shiftfs does work simply because it doesn't really do overlays,
so lots of stuff that doesn't work with overlays does work with it.

> > >  - Passes accurate disk usage and source information from the
> > > "underlay"
> > 
> > mounts of this type don't currently show up in df
> > 
> > >  - Works with a variety of filesystems (ext4, xfx, btrfs, etc.)
> > 
> > yes
> > 
> > >  - Works with nested containers
> > 
> > yes
> 
> I'd say not so much:
> 
>         /* to mark a mount point, must be real root */
>         if (ssi->mark && !capable(CAP_SYS_ADMIN))
>                 goto out;
> 
> So within a container I cannot mark a range to be shiftfs-mountable
> within a container I create. I'd argue that as long as a user has
> CAP_SYS_ADMIN towards sb->s_user_ns for the source filesystem it
> should be safe to allow this as it implies privleges wrt all ids
> found in the source mount. This will likely lead to stacked shiftfs
> mounts, not sure yet whether or not this works in the current code.

Um, I think we have different definitions of "works with nested
containers".  Recall that for a nested container the s_user_ns is also
nested, so we shift all the way back to the uid in the root.  That
means if the check for marking is not capable(CAP_SYS_ADMIN) then an
unprivileged user would be able to gain root write access by setting up
a nested shift.  If your definition of nested means we only shift back
one level of user_ns nesting then this could become ns_capable(), so I
think we need to add "what is the desired nesting behaviour?" to the
questions to be answered by the requirements.

James