Re: shiftfs status and future development

"Serge E. Hallyn" <serge@xxxxxxxxxx> · Tue, 3 Jul 2018 17:05:24 -0500

Quoting Stéphane Graber (stgraber@xxxxxxxxxx):
> On Tue, Jul 03, 2018 at 11:54:50AM -0500, Serge E. Hallyn wrote:
> > Quoting James Bottomley (James.Bottomley@xxxxxxxxxxxxxxxxxxxxx):
> > > On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote:
> > > > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley
> > > > <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > > > [...]
> > > > > > > >  - Does not break inotify
> > > > > > > 
> > > > > > > I don't expect it does, but I haven't checked.
> > > > > > 
> > > > > > I haven't checked either; I'm planning to do so soon. This is a
> > > > > > concern that was expressed to me by others, I think because
> > > > > > inotify doesn't work with overlayfs.
> > > > > 
> > > > > I think shiftfs does work simply because it doesn't really do
> > > > > overlays, so lots of stuff that doesn't work with overlays does
> > > > > work with it.
> > > > > 
> > > > 
> > > > I'm afraid shiftfs suffers from the same problems that the old naiive
> > > > overlayfs inode implementation suffered from.
> > > > 
> > > > This problem is demonstrated with LTP tests inotify08 inotify09.
> > > > shiftfs_new_inode() is called on every lookup, so inotify watch
> > > > may be set on an inode object, then dentry is evicted from cache
> > > > and then all events on new dentry are not reported on the watched
> > > > inode. You will need to implement hashed inodes to solve it.
> > > > Can be done as overlay does - hashing by real inode pointer.
> > > > 
> > > > This is just one of those subtle things about stacked fs and there
> > > > may be other in present and more in future - if we don't have a
> > > > shared code base for the two stacked fs, I wager you are going to end
> > > > up "cherry picking" fixes often.
> > > > 
> > > > IMO, an important question to ask is, since both shiftfs and
> > > > overlayfs are strongly coupled with container use cases, are there
> > > > users that are interested in both layering AND shifting? on the same
> > > > "mark"? If the answer is yes, then this may be an argument in favor
> > > > of integrating at least some of shittfs functionality into overlayfs.
> > > 
> > > My container use case is interested in shifting but not layering.  Even
> > > the docker use case would only mix the two with the overlay graph
> > > driver.  There seem to be quite a few clouds using non overlayfs graph
> > > drivers (the dm one being the most popular).
> > > 
> > > > Another argument is that shiftfs itself takes the maximum allowed
> > > > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not
> > > > possible with current VFS limitation to combine shiftfs with
> > > > overlayfs.
> > > 
> > > That's an artificial, not an inherent, restriction that was introduced
> > > to keep the call stack small.  It can be increased or even eliminated
> > > (although then we'd risk a real run off the end of the kernel stack
> > > problem).
> > > 
> > > > This could be solved relatively easily by adding "-o mark" support
> > > > to overlayfs and allowing to mount shiftfs also over "marked"
> > > > overlayfs inside container.
> > > 
> > > Can we please decided whether the temporary mark, as implemented in the
> > > current patch set or a more permanent security.<something> xattr type
> > > mark is preferred for this?  It's an important question that's been
> > > asked, but we have no resolution on.
> > 
> > I think something permanent is mandatory.  Otherwise users may be able
> > to induce a reboot into a state where the temp mark isn't made.  A
> > security.<something> xattr has the problem that an older kernel may not
> > know about it.
> > 
> > Two possibilities which have been mentioned before:
> > 
> > 1. just demand that the *source* idmap doesn't start at 0.  Ideally it
> > would be something like 100k uids starting at 100k.  The kernel would
> > refuse to do a shiftfs mount if the source idmap includes uid 0.
> > 
> > I suppose the "let-them-shoot-themselves-in-the-foot" approach would be
> > to just strongly recommend using such a source uid mapping, but not
> > enforce it.
> > 
> > 2. Enforce that the base directory have perms 700 for shiftfs-mount to
> > be allowed.  So /var/lib/shiftfs/base-rootfs might be root-owned with
> > 700 perms.  Root then can shiftfs-mount it to /container1 uid-shifted to
> > 100000:0:100000.  Yes, root could stupidly change the perms later, but
> > failing that uid 1000 can never exploit a setuid-root script under
> > /var/lib/shiftfs/base-rootfs
> > 
> > -serge
> 
> I'm personally in favor of this second approach and is what I was hoping to use
> for LXD in the future. It's trivial for us to apply that to the parent
> directory of the container (so that the rootfs itself can have a
> different mode) and is something we're already doing today for
> privileged containers anyway (as those have the exact same issue with
> setuid binaries).
> 
> Having the data on the host shifted to a map which is never used by any
> user on the system would make mistakes less likely to happen but it
> would also require you to know ahead of time how many uid/gid you'll
> need for all future containers that will use that filesystem. 100k for
> example would effectively prevent the use of many network authentication
> systems inside the container as those tend to map to the 200k+ uid
> range.

Well, maybe not if it's overlay-based.  In that case we could do
something where the base fs might only have 10k uids mapped, but the
overlay has 300k.  But that gets a bit funky :)