On Wed, 2020-02-19 at 13:27 +0100, Christian Brauner wrote: > On Tue, Feb 18, 2020 at 03:50:56PM -0800, James Bottomley wrote: > > On Tue, 2020-02-18 at 15:33 +0100, Christian Brauner wrote: [...] > > > With fsid mappings we can solve this by writing an id mapping of > > > 0 100000 100000 and an fsid mapping of 0 300000 100000. On > > > filesystem access the kernel will now lookup the mapping for > > > 300000 in the fsid mapping tables of the user namespace. And > > > since such a mapping exists, the corresponding files will have > > > correct ownership. > > > > So I did compile this up in order to run the shiftfs tests over it > > to see how it coped with the various corner cases. However, what I > > find is it simply fails the fsid reverse mapping in the > > setup. Trying to use a simple uid of 0 100000 1000 and a fsid of > > 100000 0 1000 fails the entry setuid(0) call because of this code: > > This is easy to fix. But what's the exact use-case? Well, the use case I'm looking to solve is the same one it's always been: getting a deprivileged fake root in a user_ns to be able to write an image at fsuid 0. I don't think it's solvable in your current framework, although allowing the domain to be disjoint might possibly hack around it. The problem with the proposed framework is that there are no backshifts from the filesystem view, there are only forward shifts to the filesystem view. This means that to get your framework to write a filesystem at fsuid 0 you have to have an identity map for fsuid. Which I can do: I tested uid shift 0 100000 1000 and fsuid shift 0 0 1000. It does all work, as you'd expect because the container has real fs root not a fake root. And that's the whole problem: Firstly, I'm fs root for any filesystem my userns can see, so any imprecision in setting up the mount namespace of the container and I own your host and secondly any containment break and I'm privileged with respect to the fs uid wherever I escape to so I will likewise own your host. The only way to keep containment is to have a zero fsuid inside the container corresponding to a non-zero one outside. And the only way to solve the imprecision in mount namespace issue is to strictly control the entry point at which the writing at fsuid 0 becomes active. James