Quoting Stéphane Graber (stgraber@xxxxxxxxxx): > On Tue, Jul 03, 2018 at 11:54:50AM -0500, Serge E. Hallyn wrote: > > Quoting James Bottomley (James.Bottomley@xxxxxxxxxxxxxxxxxxxxx): > > > On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote: > > > > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley > > > > <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > > [...] > > > > > > > > - Does not break inotify > > > > > > > > > > > > > > I don't expect it does, but I haven't checked. > > > > > > > > > > > > I haven't checked either; I'm planning to do so soon. This is a > > > > > > concern that was expressed to me by others, I think because > > > > > > inotify doesn't work with overlayfs. > > > > > > > > > > I think shiftfs does work simply because it doesn't really do > > > > > overlays, so lots of stuff that doesn't work with overlays does > > > > > work with it. > > > > > > > > > > > > > I'm afraid shiftfs suffers from the same problems that the old naiive > > > > overlayfs inode implementation suffered from. > > > > > > > > This problem is demonstrated with LTP tests inotify08 inotify09. > > > > shiftfs_new_inode() is called on every lookup, so inotify watch > > > > may be set on an inode object, then dentry is evicted from cache > > > > and then all events on new dentry are not reported on the watched > > > > inode. You will need to implement hashed inodes to solve it. > > > > Can be done as overlay does - hashing by real inode pointer. > > > > > > > > This is just one of those subtle things about stacked fs and there > > > > may be other in present and more in future - if we don't have a > > > > shared code base for the two stacked fs, I wager you are going to end > > > > up "cherry picking" fixes often. > > > > > > > > IMO, an important question to ask is, since both shiftfs and > > > > overlayfs are strongly coupled with container use cases, are there > > > > users that are interested in both layering AND shifting? on the same > > > > "mark"? If the answer is yes, then this may be an argument in favor > > > > of integrating at least some of shittfs functionality into overlayfs. > > > > > > My container use case is interested in shifting but not layering. Even > > > the docker use case would only mix the two with the overlay graph > > > driver. There seem to be quite a few clouds using non overlayfs graph > > > drivers (the dm one being the most popular). > > > > > > > Another argument is that shiftfs itself takes the maximum allowed > > > > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not > > > > possible with current VFS limitation to combine shiftfs with > > > > overlayfs. > > > > > > That's an artificial, not an inherent, restriction that was introduced > > > to keep the call stack small. It can be increased or even eliminated > > > (although then we'd risk a real run off the end of the kernel stack > > > problem). > > > > > > > This could be solved relatively easily by adding "-o mark" support > > > > to overlayfs and allowing to mount shiftfs also over "marked" > > > > overlayfs inside container. > > > > > > Can we please decided whether the temporary mark, as implemented in the > > > current patch set or a more permanent security.<something> xattr type > > > mark is preferred for this? It's an important question that's been > > > asked, but we have no resolution on. > > > > I think something permanent is mandatory. Otherwise users may be able > > to induce a reboot into a state where the temp mark isn't made. A > > security.<something> xattr has the problem that an older kernel may not > > know about it. > > > > Two possibilities which have been mentioned before: > > > > 1. just demand that the *source* idmap doesn't start at 0. Ideally it > > would be something like 100k uids starting at 100k. The kernel would > > refuse to do a shiftfs mount if the source idmap includes uid 0. > > > > I suppose the "let-them-shoot-themselves-in-the-foot" approach would be > > to just strongly recommend using such a source uid mapping, but not > > enforce it. > > > > 2. Enforce that the base directory have perms 700 for shiftfs-mount to > > be allowed. So /var/lib/shiftfs/base-rootfs might be root-owned with > > 700 perms. Root then can shiftfs-mount it to /container1 uid-shifted to > > 100000:0:100000. Yes, root could stupidly change the perms later, but > > failing that uid 1000 can never exploit a setuid-root script under > > /var/lib/shiftfs/base-rootfs > > > > -serge > > I'm personally in favor of this second approach and is what I was hoping to use > for LXD in the future. It's trivial for us to apply that to the parent > directory of the container (so that the rootfs itself can have a > different mode) and is something we're already doing today for > privileged containers anyway (as those have the exact same issue with > setuid binaries). > > Having the data on the host shifted to a map which is never used by any > user on the system would make mistakes less likely to happen but it > would also require you to know ahead of time how many uid/gid you'll > need for all future containers that will use that filesystem. 100k for > example would effectively prevent the use of many network authentication > systems inside the container as those tend to map to the 200k+ uid > range. Well, maybe not if it's overlay-based. In that case we could do something where the base fs might only have 10k uids mapped, but the overlay has 300k. But that gets a bit funky :)