On Sat, Jan 04, 2020 at 12:39:45PM -0800, James Bottomley wrote: > This implementation reverse shifts according to the user_ns belonging > to the mnt_ns. So if the vfsmount has the newly introduced flag > MNT_SHIFT and the current user_ns is the same as the mount_ns->user_ns > then we shift back using the user_ns before committing to the > underlying filesystem. > > For example, if a user_ns is created where interior (fake root, uid 0) > is mapped to kernel uid 100000 then writes from interior root normally > go to the filesystem at the kernel uid. However, if MNT_SHIFT is set, > they will be shifted back to write at uid 0, meaning we can bind mount > real image filesystems to user_ns protected faker root. Thanks, James, I definately would like to see shifting in the VFS api. I have a few practical concerns about this implementation, but my biggest concern is more fundemental: this again by design leaves littered about the filesystem uid-0 owned files which were written by an untrusted user. I would feel much better if you institutionalized having the origin shifted. For instance, take a squashfs for a container fs, shift it so that fsuid 0 == hostuid 100000. Mount that, with a marker saying how it is shifted, then set 'shiftable'. Now use that as a base for allowing an unpriv user to shift. If that user has subuid 200000 as container uid 0, then its root will write files as uid 100000 in the fs. This isn't perfect, but I think something along these lines would be far safer. Two namespaces with different uid maps can share the filesystem as though they both had the same uidmap. (This currently is to me the most interesting use case for shifing bind mounts) If the user wants to tar up the result, they can do do in their own namespace, resulting in uid 0 shown as uid 0. If host root wants to do so, they can umount it everywhere and use something like fuidshift. Or, they can also create a namespace to do the shifting to uid 0 in tar. My more practical concerns include: (1) once a userns has set a shiftable bind mount to shift, if it then creates a new child userns, that ns will not see (iiuc) see the fs as shifted. (2) there seems to be no good reason to stick to caching the cred for only one mnt, versus having a per-userns hashtable of creds for shifted mnts. Was that just a temporary shortcut or did you intend it to stay that way?