On Tue, 2020-02-18 at 09:18 +0200, Amir Goldstein wrote: > On Mon, Feb 17, 2020 at 10:56 PM James Bottomley > <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote: > > > > The object of this series is to replace shiftfs with a proper > > uid/gid shifting bind mount instead of the shiftfs hack of > > introducing something that looks similar to an overlay filesystem > > to do it. > > > > The VFS still has the problem that in order to tell what vfsmount a > > dentry belongs to, struct path would have to be threaded everywhere > > struct dentry currently is. However, this patch is structured only > > to require a rethreading of notify_change. The rest of the > > knowledge that a shift is in operation is carried in the task > > structure by caching the unshifted credentials. > > > > Note that although it is currently dependent on the new configfd > > interface for bind mounts, only patch 3/3 relies on this, and the > > whole thing could be redone as a syscall or any other mechanism > > (depending on how people eventually want to fix the problem with > > the new fsconfig mechanism being unable to reconfigure bind > > mounts). > > > > The changes from v2 are I've added Amir's reviewed-by for the > > notify_change rethreading and I've implemented Serge's request for > > a base offset shift for the image. It turned out to be much harder > > to implement a simple linear shift than simply to do it through a > > different userns, so that's how I've done it. The userns you need > > to set up for the offset shifted image is one where the interior > > uid would see the shifted image as fake root. I've introduced an > > additional "ns" config parameter, which must be specified when > > building the allow shift mount point (so it's done by the admin, > > not by the unprivileged user). I've also taken care that the image > > shifted to zero (real root) is never visible in the > > filesystem. Patch 3/3 explains how to use the additional "ns" > > parameter. > > > > > > James, > > To us common people who do not breath containers, your proposal seems > like a competing implementation to Christian's proposal [1]. I think we have three things that swirl around this space and aren't quite direct copies of each other's use cases but aren't entirely disjoint either: the superblock user namespace, this and the user namespace fsid mapping. > If it were a competing implementation, I think Christian's proposal > would have won by points for being less intrusive to VFS. Heh, that one's subjective. I think the current fsid code is missing quite a few use cases in the stat/attr/in_group_p cases. I'm just building the code now to run it through the shiftfs tests and see how it fares. I think once those cases are added, the VFS changes in fsid will be the same as I have in patch 2/3 ... primarily because we all have to shift the same thing at the same point. If you include the notify_change rethreading then, yes, you're correct, but that patch does stand on its own and is consonant with a long term vfs goal of using path instead of dentry. > But it is not really a competing implementation, is it? Your > proposals meet two different, but very overlapping, set of > requirements. IMHO, none of you did a really good job of explaining > that in the cover latter, let alone, refer to each others proposals > (I am referring to your v3 posting of course). Yes, I know, but the fsid one is only a few days old so I haven't had time to absorb all of it yet. > IIUC, Christian's proposal deals with single shared image per > non-overlapping groups of containers. And it deals with this use case > very elegantly IMO. From your comments on Christian's post, it does > not seem that you oppose to his proposal, except that it does not > meet the requirements for all of your use cases. No, but I think it could. It's one of these perennial problems of more generic vs more specific to use case. I'm a bit lost in really what we need for containers. In the original shiftfs I made it superblock based precisely because that was the only way I could integrate s_user_ns into it ... and I thought that was a good idea. > IIUC, your proposal can deal with multiple shared images per > overlapping groups of containers That's right ... it does the shift based on path and user namespace. In theory it allows an image with shifted and unshifted pieces ... however, I'm not sure there's even a use case for that granularity because all my current image shifting use cases are all or nothing. The granularity is an accident of the bind mount implementation. > and it adds an element of "auto-reverse-mapping", which reduces the > administration overhead of this to be nightmare of orchestration. Right, but the same thing could be an option to the fsid proposal: the default use could shift forward on kuid and back on the same map for fsuid ... then it would do what shiftfs does. Likewise, shiftfs could pick up the shift from an existing namespace and thus look more like what fsuid does. > It seems to me, that you should look into working your patch set on > top of fsid mapping and try to make use of it as much as possible. > And to make things a bit more clear to the rest of us, you should > probably market your feature as "auto back shifting mount" or > something like that and explain the added value of the feature on top > of plain fsid mapping Well we both need the same shift points, so we could definitely both work off a generic vfs encapsulation of "shift needed here". Once that's done it does become a question of use and activation. I can't help feeling that now that we've been around the houses a few times, s_user_ns is actually in the wrong place and it should be in the mount struct not the superblock. I get the impression we've got the what we need to expose (the use cases) well established (at least in our heads). The big question your asking is implementation (the how) and also whether there isn't a combination of the exposures that works for everyone. I think this might make a very good session at LSF/MM. The how piece is completely within the purview of kernel developers. The use case is more problematic because that does involve the container orchestration community. However, I think at LSF/MM if we could get agreement on a unified implementation that covers the three use cases we're in a much better position to have the container orchestration conversation because it's simply a case of tweaking the activation mechanisms. I'll propose it as a topic. James