Re: shiftfs status and future development

Seth Forshee <seth.forshee@xxxxxxxxxxxxx> · Fri, 15 Jun 2018 09:59:17 -0500

On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote:
> Quoting Seth Forshee (seth.forshee@xxxxxxxxxxxxx):
> > I wanted to inquire about the current status of shiftfs and the plans
> > for it moving forward. We'd like to have this functionality available
> > for use in lxd, and I'm interesetd in helping with development (or
> > picking up development if it's stalled).
> > 
> > To start, is anyone still working on shiftfs or similar functionality? I
> > haven't found it in any git tree on kernel.org, and as far as mailing
> > list activity the last submission I can find is [1]. Is there anything
> > newer than this?
> > 
> > Based on past mailing list discussions, it seems like there was still
> > debate as to whether this feature should be an overlay filesystem or
> > something supported at the vfs level. Was this ever resolved?
> > 
> > Thanks,
> > Seth
> > 
> > [1] http://lkml.kernel.org/r/1487638025.2337.49.camel@xxxxxxxxxxxxxxxxxxxxx
> 
> Hey Seth,
> 
> I haven't heard anything in a long time.  But if this is going to pick
> back up, can we come up with a detailed set of goals and requirements?

I was planning to follow up later with some discussion of requirements.
Here are some of ours:

 - Supports any id maps possible for a user namespace

 - Does not break inotify

 - Passes accurate disk usage and source information from the "underlay"

 - Works with a variety of filesystems (ext4, xfx, btrfs, etc.)

 - Works with nested containers

I'm also interested in collecting any requirements others might have.

> I don't recall whether the last version still worked like this, but I'm
> still not comfortable with the idea of a system where after a reboot,
> container-created root-owned files are owned by host root until a path
> is specially marked.  Enforcing that the "source" directory is itself
> uid-shifted would greatly ease my mind.

I understand the concern and share the discomfort to some degree, but
I'm not convinced that requiring the source subtree be shifted is the
right approach.

First, let's address the marking question. As you stated, an approach
that leaves the subree unmarked for a period of time is problematic, and
imo this is a fatal flaw with marking as a protection for e.g. execing
some suid root file written by a container. Writing some such mark to
the filesystem would make it persistent, but it could also limit the
support to a limited set of filesystems.

However, I do think it's necessary for a user with sufficient
capabilities to "bless" a subtree for mounting in a less privileged
context, so this is a feature of marking that I would like to keep. I
think the new mount apis in David Howells' filesystem context patches [1]
might give us a nicer way to do this. For example, root in init_user_ns
could set up a mount fd which specifies the source subtree for the id
shift. At that time the kernel could check for ns_capable(sb->s_user_ns,
CAP_SYS_ADMIN) for the filesystem containing the source subtree. Then
the fd could be passed to a container in a user namespace, who could use
it to attach the mount to its filesystem tree.  The same concept could
be extended to nested containers, as long as the user setting the source
subtree has CAP_SYS_ADMIN towards sb->s_user_ns for the subtree.

Now back to reuiring the srouce subtree be id shifted. I understand the
motivation for wanting this, but I'm not sure I'm in favor of it. To
start, there are other ways to ensure that id shifted mounts don't lead
to problems, such as putting the subtree under a directory accessible
only by root or putting it in a nosuid or noexec mount. For some
implementations those sorts of protections are going to make sense.

Having this requirement may also add significant time to mounting, as I
assume it would involve iterating through all filesystem objects.

Additionally, that requirement is likely to significantly complicate the
implementation. The simplest implementation would just translate the
k[ug]ids in the inodes to a target user ns. A slightly more complicated
approach might translate them based on a source and destination user ns.
If it's implemented based on passing in an arbitrary id map at mount
time it will be more complex and duplicate functionality that user
namespaces already give us.

Thanks,
Seth

[1] http://lkml.kernel.org/r/152720672288.9073.9868393448836301272.stgit@xxxxxxxxxxxxxxxxxxxxxx