On Tue, Jul 03, 2018 at 11:54:50AM -0500, Serge E. Hallyn wrote: > Quoting James Bottomley (James.Bottomley@xxxxxxxxxxxxxxxxxxxxx): > > On Mon, 2018-06-18 at 20:11 +0300, Amir Goldstein wrote: > > > On Mon, Jun 18, 2018 at 5:56 PM, James Bottomley > > > <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > [...] > > > > > > > - Does not break inotify > > > > > > > > > > > > I don't expect it does, but I haven't checked. > > > > > > > > > > I haven't checked either; I'm planning to do so soon. This is a > > > > > concern that was expressed to me by others, I think because > > > > > inotify doesn't work with overlayfs. > > > > > > > > I think shiftfs does work simply because it doesn't really do > > > > overlays, so lots of stuff that doesn't work with overlays does > > > > work with it. > > > > > > > > > > I'm afraid shiftfs suffers from the same problems that the old naiive > > > overlayfs inode implementation suffered from. > > > > > > This problem is demonstrated with LTP tests inotify08 inotify09. > > > shiftfs_new_inode() is called on every lookup, so inotify watch > > > may be set on an inode object, then dentry is evicted from cache > > > and then all events on new dentry are not reported on the watched > > > inode. You will need to implement hashed inodes to solve it. > > > Can be done as overlay does - hashing by real inode pointer. > > > > > > This is just one of those subtle things about stacked fs and there > > > may be other in present and more in future - if we don't have a > > > shared code base for the two stacked fs, I wager you are going to end > > > up "cherry picking" fixes often. > > > > > > IMO, an important question to ask is, since both shiftfs and > > > overlayfs are strongly coupled with container use cases, are there > > > users that are interested in both layering AND shifting? on the same > > > "mark"? If the answer is yes, then this may be an argument in favor > > > of integrating at least some of shittfs functionality into overlayfs. > > > > My container use case is interested in shifting but not layering. Even > > the docker use case would only mix the two with the overlay graph > > driver. There seem to be quite a few clouds using non overlayfs graph > > drivers (the dm one being the most popular). > > > > > Another argument is that shiftfs itself takes the maximum allowed > > > 2 levels of s_stack_depth for it's 2 mounts, so it is actually not > > > possible with current VFS limitation to combine shiftfs with > > > overlayfs. > > > > That's an artificial, not an inherent, restriction that was introduced > > to keep the call stack small. It can be increased or even eliminated > > (although then we'd risk a real run off the end of the kernel stack > > problem). > > > > > This could be solved relatively easily by adding "-o mark" support > > > to overlayfs and allowing to mount shiftfs also over "marked" > > > overlayfs inside container. > > > > Can we please decided whether the temporary mark, as implemented in the > > current patch set or a more permanent security.<something> xattr type > > mark is preferred for this? It's an important question that's been > > asked, but we have no resolution on. > > I think something permanent is mandatory. Otherwise users may be able > to induce a reboot into a state where the temp mark isn't made. A > security.<something> xattr has the problem that an older kernel may not > know about it. > > Two possibilities which have been mentioned before: > > 1. just demand that the *source* idmap doesn't start at 0. Ideally it > would be something like 100k uids starting at 100k. The kernel would > refuse to do a shiftfs mount if the source idmap includes uid 0. > > I suppose the "let-them-shoot-themselves-in-the-foot" approach would be > to just strongly recommend using such a source uid mapping, but not > enforce it. > > 2. Enforce that the base directory have perms 700 for shiftfs-mount to > be allowed. So /var/lib/shiftfs/base-rootfs might be root-owned with > 700 perms. Root then can shiftfs-mount it to /container1 uid-shifted to > 100000:0:100000. Yes, root could stupidly change the perms later, but > failing that uid 1000 can never exploit a setuid-root script under > /var/lib/shiftfs/base-rootfs > > -serge I'm personally in favor of this second approach and is what I was hoping to use for LXD in the future. It's trivial for us to apply that to the parent directory of the container (so that the rootfs itself can have a different mode) and is something we're already doing today for privileged containers anyway (as those have the exact same issue with setuid binaries). Having the data on the host shifted to a map which is never used by any user on the system would make mistakes less likely to happen but it would also require you to know ahead of time how many uid/gid you'll need for all future containers that will use that filesystem. 100k for example would effectively prevent the use of many network authentication systems inside the container as those tend to map to the 200k+ uid range. -- Stéphane Graber Ubuntu developer http://www.ubuntu.com
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers