On Fri, Jun 15, 2018 at 02:09:52PM -0700, James Bottomley wrote: > On Fri, 2018-06-15 at 15:47 -0500, Seth Forshee wrote: > > On Fri, Jun 15, 2018 at 10:22:09AM -0700, James Bottomley wrote: > > > On Sat, 2018-06-16 at 03:04 +1000, Aleksa Sarai wrote: > > > > On 2018-06-15, James Bottomley <James.Bottomley@HansenPartnership > > > > .com > > > > > wrote: > > > > > > > - Supports any id maps possible for a user namespace > > > > > > > > > > > > Have we already ruled out storing the container's > > > > > > UID/GID/perms in an extended attribute, and having all the > > > > > > files owned by the owner of the container from the > > > > > > perspective of the unshifted fs. Then shiftfs reads the > > > > > > xattr and presents the files with the container's idea of > > > > > > what the UID is? > > > > > > > > > > I've got an experimental patch set that does the *mark* as an > > > > > xattr. > > > > > > > > I forgot to ask you about this when we all met face-to-face -- > > > > can you go over what the purpose of marking the mounts before > > > > being able to shifts is? When I saw your demo at LPC I was quite > > > > confused about what it was doing (I think you mentioned it was a > > > > security feature, but I must admit I didn't follow the > > > > explanation). > > > > > > OK, so the basic security problem is that an unprivileged tenant > > > cannot be allowed arbitrary access to both the shifted and > > > underlying unshifted locations because they can do writes to the > > > shifted mount that appear at real uid/gid 0 in the underlying > > > unshifted location, setting up all sorts of unpleasant threats of > > > which suid execution is just the most obvious one. > > > > > > My mount marking solution, which the v2 (and forthcoming v3) has is > > > the idea that the admin buries the real underlying location deep in > > > a path inaccessible (to the tenant) part of the filesystem and then > > > exposes a marked mount point to the tenant by doing > > > > > > mount -t shiftfs -o mark <underlying location> <tenant visible> > > > > > > Then in the <tenant visible> location we can block the potential > > > exploits. When the tenant is building an unprivileged container, > > > it can do > > > > > > mount -t shiftfs <tenant visible> <container location> > > > > > > And the <container location> will now have the shifting in place. > > > > More generally, we can't allow an unprivileged user ns to mount any > > subtree with an id shift unless the context that controls that > > subtree (i.e. CAP_SYS_ADMIN in sb->s_user_ns) allows it. Otherwise it > > would be a simple matter for any user to create a user ns and make an > > id shifted mount of /. The marking in shiftfs is one way of solving > > this problem. > > > > I don't know if you saw my comments about marking earlier in the > > thread. Tl;dr, I think that the new mount apis in the filesystem > > context patches could allow an alternative to marking. > > Yes, I read it an I have a draft reply except that I need to > investigate the new mount API. My suspicion is that it won't work over > standard mount(8) so now we'll need a tool to set it up. > > The other point is that it's a temporary mark (again) so it has to be > renewed over every reboot. The impression I picked up was this was the > least liked part of the current marking scheme (well, possibly first > equally disliked with the complexity) so the xattr (or some other) > scheme might be better received. >From my perspective the primary benefit of a permanent mark is that the kernel could implement some protections for marked subtrees, and presumably we could do away with the intermediate mount. That sounds much nicer than the current marking scheme, imo. I do understand that for others less admin-side setup work is also a benefit. > > I think we should be able to arrange it so that the "host" context > > sets up a mount fd for shiftfs mounting a sepecific subtree then > > passes that fd into the container. The container can then use the fd > > to attach the mount to its filesystem tree. This will provide all the > > benefits of marking without that awkward intermediate mount point. > > Doesn't this mean that whatever sits on the host end has to be > privileged enough to open the fd in the first place, meaning either a > suid/CAP_SYS_<something> binary or root itself? Yes to the "something privieleged enough" question. It should not be a suid/setcap executable; an unprivileged user must not be able to pick what paths are eligible for id shifting. That has to be an admin operation. > > Of course those patches haven't been merged yet, but based on the > > discussion I've seen their prospects look good. > > > > > This scheme is ephemeral (the marked mount has to be recreated on > > > every boot) and rather complex, so the alternative is to add a > > > permanent mark to <underlying location> so that regular tenant > > > access can be secured (or even prohibited) but the tenant can still > > > do > > > > > > mount -t shiftfs <underlying location> <container location> > > > > This of course would not be possible in my proposed mount fd scheme. > > > > > To get the shifting properties in the container. In this version > > > of the scheme, the shift mountable directory is marked with a > > > security xattr that is permanent (survives reboot) but requires > > > that the filesystem support xattrs, of course. > > > > > > The down side of the xattr scheme is that the securing against the > > > tenant part becomes an xattr enforced thing rather than a shiftfs > > > enforced thing, so it has to be an additional patch to the kernel > > > itself rather than being inside a self contained module. > > > > Would this work for nested containers? I guess it should be fine to > > allow setting that xattr for CAP_SYS_ADMIN in sb->s_user_ns, so > > probably so. > > Yes, the s_user_ns is simply a copy of the user_ns in operation when > the mount was created, so if user_ns is nested then s_user_ns will be a > reverse nesting. This should work independently of the marking scheme. > > I'm just in the middle of reposting. The cc list is getting unwieldy; > is it OK if I the fsdevel, kernel, security and containers lists and > drop everyone except for a couple of containers people who aren't on > kernel lists? You don't have to include me on the Cc. Thanks, Seth