On Fri, Jun 15, 2018 at 08:56:38AM -0500, Serge E. Hallyn wrote: > Quoting Seth Forshee (seth.forshee@xxxxxxxxxxxxx): > > I wanted to inquire about the current status of shiftfs and the plans > > for it moving forward. We'd like to have this functionality available > > for use in lxd, and I'm interesetd in helping with development (or > > picking up development if it's stalled). > > > > To start, is anyone still working on shiftfs or similar functionality? I > > haven't found it in any git tree on kernel.org, and as far as mailing > > list activity the last submission I can find is [1]. Is there anything > > newer than this? > > > > Based on past mailing list discussions, it seems like there was still > > debate as to whether this feature should be an overlay filesystem or > > something supported at the vfs level. Was this ever resolved? > > > > Thanks, > > Seth > > > > [1] http://lkml.kernel.org/r/1487638025.2337.49.camel@xxxxxxxxxxxxxxxxxxxxx > > Hey Seth, > > I haven't heard anything in a long time. But if this is going to pick > back up, can we come up with a detailed set of goals and requirements? I was planning to follow up later with some discussion of requirements. Here are some of ours: - Supports any id maps possible for a user namespace - Does not break inotify - Passes accurate disk usage and source information from the "underlay" - Works with a variety of filesystems (ext4, xfx, btrfs, etc.) - Works with nested containers I'm also interested in collecting any requirements others might have. > I don't recall whether the last version still worked like this, but I'm > still not comfortable with the idea of a system where after a reboot, > container-created root-owned files are owned by host root until a path > is specially marked. Enforcing that the "source" directory is itself > uid-shifted would greatly ease my mind. I understand the concern and share the discomfort to some degree, but I'm not convinced that requiring the source subtree be shifted is the right approach. First, let's address the marking question. As you stated, an approach that leaves the subree unmarked for a period of time is problematic, and imo this is a fatal flaw with marking as a protection for e.g. execing some suid root file written by a container. Writing some such mark to the filesystem would make it persistent, but it could also limit the support to a limited set of filesystems. However, I do think it's necessary for a user with sufficient capabilities to "bless" a subtree for mounting in a less privileged context, so this is a feature of marking that I would like to keep. I think the new mount apis in David Howells' filesystem context patches [1] might give us a nicer way to do this. For example, root in init_user_ns could set up a mount fd which specifies the source subtree for the id shift. At that time the kernel could check for ns_capable(sb->s_user_ns, CAP_SYS_ADMIN) for the filesystem containing the source subtree. Then the fd could be passed to a container in a user namespace, who could use it to attach the mount to its filesystem tree. The same concept could be extended to nested containers, as long as the user setting the source subtree has CAP_SYS_ADMIN towards sb->s_user_ns for the subtree. Now back to reuiring the srouce subtree be id shifted. I understand the motivation for wanting this, but I'm not sure I'm in favor of it. To start, there are other ways to ensure that id shifted mounts don't lead to problems, such as putting the subtree under a directory accessible only by root or putting it in a nosuid or noexec mount. For some implementations those sorts of protections are going to make sense. Having this requirement may also add significant time to mounting, as I assume it would involve iterating through all filesystem objects. Additionally, that requirement is likely to significantly complicate the implementation. The simplest implementation would just translate the k[ug]ids in the inodes to a target user ns. A slightly more complicated approach might translate them based on a source and destination user ns. If it's implemented based on passing in an arbitrary id map at mount time it will be more complex and duplicate functionality that user namespaces already give us. Thanks, Seth [1] http://lkml.kernel.org/r/152720672288.9073.9868393448836301272.stgit@xxxxxxxxxxxxxxxxxxxxxx