On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote: > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote: > > On Thu, 2016-05-05 at 18:08 -0400, James Bottomley wrote: > [...] > > > > > > OK, so the way attributes are populated on an inode is via > > > getattr. You intercept that, you change the inode owner and > > > group that are installed on the inode. That means that when you > > > list the directory, you see the shift and the shifted uid/gid are > > > used to check permissions for vfs_open(). > > > > Just to illustrate how this could be done, here's a functional > > proof of concept for a uid/gid shifting bind mount equivalent. > > It's not actually a proper bind mount because it has to > > manufacture its own inodes. As you can see, it can only be used by > > root, it will shift all the uid/gid bits as well as the permission > > comparisons. It operates on subtrees, so it can shift the > > uids/gids on any filesystem or part of one and because the shifts > > are per superblock, it could actually shift the same subtree for > > multiple users on different shifts. Best of all, it requires no > > vfs changes at all, being entirely implemented inside its own > > filesystem type. > > First, I guess this should be in a separate thread.. this way this > RFC was just hijacked! > > Obviously as you say later in your response it may require a VFS > change... I thought it may but viro didn't rip my head off for shifting the file operations and inode, so perhaps it's OK as is. > You have just consumed all inodes... what about containers or small > apps that are spawned quickly... it can even used maybe as a DoS... > maybe you endup reporting different inode numbers... ? Please explain? Shiftfs deliberately doesn't populate its dentry cache, so it basically has the same number inodes and dentries in use as the lower filesystem would ordinarily have. > > > You use it just like bind mount: > > > > mount -t shiftfs <source> <target> > > > > except that it takes uidshift=x:y:z and gidshift=x:y:z multiple > > times > > as options. It's currently not recursive and it definitely needs > > polishing to show things like mount options and be properly Kconfig > > using. > > why it's not recursive ? and what if you have circular bind mounts ? Because, as I said, it's a proof of concept. It can easily have MS_REC semantics added. > Hmm anyway you are mounting this on behalf of filesystems, so if you > add the recursive thing, you will just probably make everything > worse, by making any /proc, /sys dentry that's under that path > shiftable, and unprivileged users can just create user namespaces and > read /proc/* and all the other stuff that doesn't have capable() > related to the init_user_ns host... That's up to the admin who does the shifting. Recursive would be an option if added. > what if you have paths like /filesystem0/uidshiftedY/dir, > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir > where some of them are also bind mounts that point to same dentry ? Without recursive semantics, you see the underlying inode. With them, you see the upper vfsmnts. Shiftfs isn't idempotent, so you would need to be careful about nesting. However, that's an admin problem. > Also, you create a totally new user namespace interface here! by > making your own new interface we just lose the notion of init_user_ns > and its children and mapping ? I don't quite understand this; the only use of the init_user_ns is the capable(CAP_SYS_ADMIN) in fill_super which is how only the real admin can mount at a shifted uid/gid. Otherwise, there's no need to see into the userns because filesystems see the kuid_t/kgid_t which is what I'm shifting. > I'm not sure of the implication of all this... your user namespace > mapping is not related at all to init_user_ns! it seems that it has > its own init_user_ns ? does a capable() check now on a shifted > filesystem relates to that and hence to your mapping or to the real > init_user_ns ? capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN) Or is there a misunderstanding here about how user namespaces work inside the kernel? The design is that the ID shift is done as you cross the kernel boundary, so a filesystem, being usually all in-kernel operating via the VFS interfaces, ideally never needs to make any from_kuid/make_kuid calls. However, there are ways filesystems can send data across the kernel/user bounary outside of the usual vfs interfaces (ioctls being the most usual one) so in that specific code, they have to do the kuid_t to uid_t changes themselves. Shiftfs never sends data to the user outside of the VFS so it never needs to do this and can operate entirely on kuid_ts. > > There's a bit of an open question of whether it should have vfs > > changes: the way the struct file f_inode and f_ops are hijacked is > > a bit nasty and perhaps d_select_inode() could be made a bit > > cleverer to help us here instead. > > I'm not sure if this PoC works... but you sure you didn't introduce > a serious vulnerability here ? you use a new mapping and you update > current_fsuid() creds up, which is global on any fs operation, so may > be: lets operate on any inode, update our current_fsuid()... and > access the rest of *unshifted filesystems*... !? The credentials are per thread, so it's a standard way of doing credential shifting and no other threads of execution in the same task get access. As long as you bound the override_creds()/revert_creds() pairs within the kernel, you're safe. > The worst thing is that current_fsuid() does not follow now the > /proc/self/uid_map interface! this is a serious vulnerability and a > mix of the current semantics... it's updated but using other > rules...? current_fsuid() is aready mapped via the userns; it's already a kuid_t at its final value. Shifting that is what you want to remap underlying volume uid/gid's. The uidmap/gidmap inputs to this are shifts on the final underlying uid/gids. So, if I've got a uid_map in a userns of 0:100000:1000 which remaps all the privileged ids down to 100000, but I have a volume which still has realids, I can mount that volume using shiftfs with uidmap=0:100000:1000 and it will allow this userns to read and write the volume through its remapped ids. > For overlayfs I did write an expriment but for me it's not an > overlayfs or another new filesystem problem... we are manipulating > UID/GID identities... > > It would have been better if you did send this as a separate thread. > It was a vfs:userns RFC fix which if we continue we turn it into a > complicated thing! implement another new light filesystem with > userns... (overlayfs...) > > Will follow up if the appropriate thread is created, not here, I > guess it's ok ? Well, I can resend the patch as a separate thread when I've fixed some of the problems viro pointed out. James > > James > > > > Thank you for your feedback! > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html