Hi Dave, Tried to do my xfs homework first! On Fri, May 06, 2016 at 12:50:36PM +1000, Dave Chinner wrote: > On Thu, May 05, 2016 at 11:24:35PM +0100, Djalal Harouni wrote: > > On Thu, May 05, 2016 at 10:23:14AM +1000, Dave Chinner wrote: > > > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote: > > > > This is version 2 of the VFS:userns support portable root filesystems > > > > RFC. Changes since version 1: > > > > > > > > * Update documentation and remove some ambiguity about the feature. > > > > Based on Josh Triplett comments. > > > > * Use a new email address to send the RFC :-) > > > > > > > > > > > > This RFC tries to explore how to support filesystem operations inside > > > > user namespace using only VFS and a per mount namespace solution. This > > > > allows to take advantage of user namespace separations without > > > > introducing any change at the filesystems level. All this is handled > > > > with the virtual view of mount namespaces. > > > > > > [...] > > > > > > > As an example if the mapping 0:65535 inside mount namespace and outside > > > > is 1000000:1065536, then 0:65535 will be the range that we use to > > > > construct UIDs/GIDs mapping into init_user_ns and use it for on-disk > > > > data. They represent the persistent values that we want to write to the > > > > disk. Therefore, we don't keep track of any UID/GID shift that was applied > > > > before, it gives portability and allows to use the previous mapping > > > > which was freed for another root filesystem... > > > > > > So let me get this straight. Two /isolated/ containers, different > > > UID/GID mappings, sharing the same files and directories. Create a > > > new file in a writeable directory in container 1, namespace > > > information gets stripped from on-disk uid/gid representation. > > > > > > Container 2 then reads that shared directory, finds the file written > > > by container 1. As there is no no namespace component to the uid:gid > > > stored in the inode, we apply the current namespace shift to the VFS > > > inode uid/gid and so it maps to root in container 2 and we are > > > allowed to read it? > > > > Only if container 2 has the flag CLONE_MNTNS_SHIFT_UIDGID set in its own > > mount namespace which only root can set or if it was already set in > > parent, and have access to the shared dir which the container manager > > should also configure before... even with that if in container 2 the > > shift flag is not set then there is no mapping and things work as they > > are now, but yes this setup is flawed! they should not share rootfs, > > maybe in rare cases, some user data that's it. > > <head explods> > > I can't follow any of the logic you're explaining - you just > confused me even more. I thought this was to allow namespaces with > different uid/gid mappings all to use the same backing store? And > now you're saying that "no, they'll all have separate backing > stores"? Dave, absolutely for root file systems or probably most if not all use cases, they should have *separate* backing devices. For (1) obvious security reasons, (2) If they are writing to the filesystem, for quota, otherwise the whole thing is useless. > I suspect you need to describe the layering in a way a stupid dummy > can understand, because trying to be clever with wacky examples is > not working. OK, see below please. > > > Unless I've misunderstood something in this crazy mapping scheme, > > > isn't this just a vector for unintentional containment breaches? > > > > > > [...] > > > > > > > Simple demo overlayfs, and btrfs mounted with vfs_shift_uids and > > > > vfs_shift_gids. The overlayfs mounts will share the same upperdir. We > > > > create two user namesapces every one with its own mapping and where > > > > container-uid-2000000 will pull changes from container-uid-1000000 > > > > upperdir automatically. > > > > > > Ok, forget I asked - it's clearly intentional. This is beyond > > > crazy, IMO. > > > > This setup is flawed! that example was to show that files show up with > > the right mapping with two different user namespaces. As Andy noted > > they should have a backing device... > > Did you mean "should have different backing devices" here? If not, > I'm even more confused now... Yes, I mean a separate different bakcing device. Now some use cases may share *some* backing devices, but then it should not be the same backing store of the host / The container manager should mount a new backing device, maybe make a snapshot of host / on it and use it for containers. > > Anyway by the previous paragraph what I mean is that when the container > > terminates it releases the UID shift range which can be re-used later > > on another filesystem or on the same previous fs... whatever. Now if > > the range is already in use, userspace should grab a new range give it > > a new filesystem or a previous one which doesn't need to be shared and > > everything should continue to work... > > This sounds like you're talking about a set of single, sequential > uses of a single filesystem image across multiple different > container lifecycles? Maybe that's where I'm getting confused, > because I'm assuming multiple concurrent uses of a single filesystem > by all the running containers that are running the same distro > image.... Ok I see, yes a sequantial usage of the single filesystem according to available mappings and resources, and as noted above if the multiple concurent containers are using the same single filesystem, then that one should be mounted as read-only! where the writable data is on another backing store for every container, or a backing store for all containers... depends on the use cases, but at least they should not write to the host / backing store... Serge Hallyn already noted this.... and by default this is not possible and solutions in the kernel exists to make sure that they can't do it only if root allows that... but maybe I should just add a patch so you can't set shifts on the host / ,disable it there and let it for only new mounts.. > > simple example with loop devices..., however the image should be a GPT > > (GUID partition table) or an MBR one... > > > > $ dd if=/dev/zero of=/tmp/fedora-newtree.raw bs=10M count=100 > > $ mkfs.ext4 /tmp/fedora-newtree.raw > > ... > > $ sudo mount -t ext4 -oloop,rw,sync /var/lib/machines/fedora-newtree.raw /mnt/fedora-tree > > $ sudo yum -y --releasever=23 --installroot=/mnt/fedora-tree --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim > > $ sudo mount -t ext4 -oloop,vfs_shift_uids,vfs_shift_gids, /var/lib/machines/fedora-newtree.raw /mnt/fedora-tree > > $ sudo ~/container --uidmap [1000000:1065536 or > > 2000000:2065536 or > > 3000000:3065536 ....} > > (That's the mapping outside of the container) > > This doesn't match your above comments about separate backing > stores. Here we have two mounts sharing the same image file, both > mounted read/write - there's no separate backing store here. The > fact you hide the initial mount that was populated by yum by > overmounting the same mount point doesn't stop the original mount > from modifying the image file independently of the container you > started. > > I'm getting the impression that there's a missing step in all your > examples here - that you create a writable snapshot or overlay of > the original fs image to create separate backing devices for each > container. In that case, the uid/gid shifting avoids needing to make > uid/gid modifications to the snapshot/overlay to match the > container's mapped uid/gids. Yes exactly! we avoid writing to the snapshot or the image... the mapping is done automatically by the kernel, and the snapshot can be verified and used later by other containers with different mapping... And yes there are some missing steps in my explanation. Container manager should create new mount namespace, remount / MS_SLAVE|MS_RED, mount things there that can be shared as a tree, then create an other mount namespace, clean it, *create* private backing devices, mount loops, mount other private filesystems on the new backing devices, or mount images... then create a new mount/user namespaces for the container and spawn it. Now containers automatically have access to their private mounts and also to the shared read-only tree, the mapping is done virtually, no need to adapt the private mount or the shared tree, no need to chown, nothing. After the container finishes, the mapping 0:1000000:1065536 is released and can be used for another container, and another container can easily use the previous filesystem and shared tree with a new mapping 0:2000000:2065536 every thing works out of the box... Now if you want the host of the container manager to not even access container mounts, then when creating new mount namespace we remount / MS_SLAVE and even the host or container managed won't see again container's mounts... Another example that we want to support: (1) The vendor OS files are all under /usr directory, you have one snapshot that you use it for all containers. (2) Container manager prepares /usr snapshot verify if it's trusted... create mount namespace backing devices, etc. Mount it a shared read-only tree on a different backing device with shifted options for containers namespaces... (3) Container manager mounts /etc and other directories in private tmpfs => this allows to spawn containers, serve and forget about them, it may allow factory reset and all the other stuff.. > Similarly, if the use case given was read-only sharing of trees > between containers, there's no need for separate snapshots or > overlays, just a bunch of read-only (bind?) mounts with shifts > specified for the intended container. > > These seem like a pretty sane use case for wanting to shift > uids/gids in this manner, but if that's the case then I'm struggling Indeed! > > > > 3) ROADMAP: > > > > =========== > > > > * Confirm current design, and make sure that the mapping is done > > > > correctly. > > > > > > How are you going to ensure that all filesystems behave the same, > > > and it doesn't get broken by people who really don't care about this > > > sort of crazy? > > > > By trying to make this a VFS mount namespace parameter. So if the > > shift is not set on on the mount namespace then we just fallback to > > the current behaviour! no shift is performed. > > That wasn't what I was asking - I was asking a code maintenance > question. i.e. someone will come along who doesn't quite understand > WTF all this convoluted namespace ID mapping is doing and they will > accidently break it in a subtle way that nobody notices because they > didn't directly change anything to do with ID shifting. What's the > plan for preventing that from happening? Of course I'm not sure if it will be accepted, I guess this is the *safest* solution so far, which can be used to support all filesystems. So if it's applied I won't throuw the code at you and move, everyone is responsible for his own mess... but before that I'll add regression tests, security tests make sure that we don't break the current behaviour then the new one... IOW as it should be done with every merged code... Now I'm not sure if it will be accepted, I see that this thing is being discussed every year during summit... maybe it's a favorite topic ?! other patches about shifts break setfsuid, make current_fsuid() cross user namespaces.. add another userns interface.. so not sure, anyway I'll try to improve the RFC later or after the merge window, include XFS support, take all feedback, let the mount namepsace and the VFS do the work, and see! > > later of course I'll try xfstests and several tests... > > > > Does this answer your question ? > > That's closer, but ambiguous. ;) Were you planning on just running > some existing tests or writing a set of regression tests that > explicitly encode expected usage and behaviour, as well as what is > expected to fail? Of course yes! > > > ..... > > > > * Add XFS support. > > > > > > What is the problem here? > > > > Yep, sorry! just lack of time from my part! XFS currently is a bit aware > > of kuid/kgid mapping on its own, and I just didn't had the appropriate > > time! Will try to fix it next time. > > You'd be talking about the xfs_kuid_to_uid/xfs_uid_to_kuid() > wrappers, right? Yes! > It comes to the kuid/kgid being kernel internal representations of > an ID, not an on-disk format representation. Like all other kernel > internal types they can change size and structure at any time, while > the persistent on-disk format cannot change without lots of hassle > (and then we really need conversion functions!). For clean layering, > abstraction and self-documenting code, internal types are always > converted to/from a persistent, on-disk format representation in > this manner. I see, thank you for the explanation! So yes we have to update all these callers with: 1) xfs_ialloc() when creating a new inode: ip->i_d.di_uid = xfs_kuid_to_uid(current_fsuid()); should be: ip->i_d.di_uid = xfs_kuid_to_uid(vfs_shift_kuid_to_disk(inode, current_fsuid())); or a new VFS function like inode_init_owner() hmm inode_init_iuid() to give you the appropriate inode->i_uid on disk ip->i_d.di_uid = xfs_kuid_to_uid(inode_init_iuid()); This way inside container, you create an inode on disk with iuid == 0 and not using the global mapping of current_fsuid() 2) xfs_kuid_to_uid() for quota should also be updated, yes and to make it suitable for XFS it should be: vfs_shift_kuid_to_disk(super_block, kuid) or vfs_shift_kuid_to_fs(super_block, kuid); => xfs_kuid_to_uid(vfs_shift_kuid_to_fs(xfs_super_block, kuid))); These are the only changes for xfs to make it really work plus parsing vfs_shift_uids/gids mount options, this should work out of the box then... maybe later see with ACL... but from a VFS side, not fs. > > > Next question: how does this work with uid/gid based quotas? > > > > If you do a shift you should know that you will share quota on > > disk. > > Yes, and this means you can't account for individual container space > usage on such mapped devices. Also, don't you need to shift > uids/gids for the quota syscalls like you do elsewhere? Yes, ok! > I also wonder about the fact that the quota interfaces are likely to > return uids/gids that may not exist in a given container... Well if they do not exist or do not have a mapping we don't perform any shift, so I guess they will return the global kuid which happens to *always* have a mapping in init_user_ns and that's the one we use to operate from the filesystem or disk side. Now inside the container, for normal use cases we try at least to map 65536 the abvious range to let a complete system work... if some uids/gids inside the container are not mapped, they share 65534 the nobody user.. > Cheers, > > Dave. Thanks for the feedback! > -- > Dave Chinner > david@xxxxxxxxxxxxx -- Djalal Harouni http://opendz.org -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html