On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote: > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote: > > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote: > > > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote: > [...] > > > Hmm anyway you are mounting this on behalf of filesystems, so if > > > you > > > add the recursive thing, you will just probably make everything > > > worse, by making any /proc, /sys dentry that's under that path > > > shiftable, and unprivileged users can just create user namespaces > > > and > > > read /proc/* and all the other stuff that doesn't have capable() > > > related to the init_user_ns host... > > > > That's up to the admin who does the shifting. Recursive would be > > an > > option if added. > > Hmm, not sure if you get my point... you just made it an admin > problem where admins want to mount an image downloaded verify it and > use it for their container with /proc...! that's another problem! You can't allow unprivileged containers to shift uids on arbitrary filesystems, so the admin always has to do something for the initial setup. > > > what if you have paths like /filesystem0/uidshiftedY/dir, > > > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir > > > where some of them are also bind mounts that point to same dentry > > > ? > > > > Without recursive semantics, you see the underlying inode. With > > them, you see the upper vfsmnts. Shiftfs isn't idempotent, so you > > would need to be careful about nesting. However, that's an admin > > problem. > > > > > Also, you create a totally new user namespace interface here! by > > > making your own new interface we just lose the notion of > > > init_user_ns and its children and mapping ? > > > > I don't quite understand this; the only use of the init_user_ns is > > the capable(CAP_SYS_ADMIN) in fill_super which is how only the real > > admin can mount at a shifted uid/gid. Otherwise, there's no need > > to see into the userns because filesystems see the kuid_t/kgid_t > > which is what I'm shifting. > > > > > I'm not sure of the implication of all this... your user > > > namespace mapping is not related at all to init_user_ns! it seems > > > that it has its own init_user_ns ? does a capable() check now > > > on a shifted filesystem relates to that and hence to your mapping > > > or to the real init_user_ns ? > > > > capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN) > > > > Or is there a misunderstanding here about how user namespaces work > > inside the kernel? The design is that the ID shift is done as you > > cross the kernel boundary, so a filesystem, being usually all in > > -kernel operating via the VFS interfaces, ideally never needs to > > make any from_kuid/make_kuid calls. However, there are ways > > filesystems can send data across the kernel/user bounary outside of > > the usual vfs interfaces (ioctls being the most usual one) so in > > that specific code, they have to do the kuid_t to uid_t changes > > themselves. Shiftfs never sends data to the user outside of the > > VFS so it never needs to do this and can operate entirely on > > kuid_ts. > > > > > > There's a bit of an open question of whether it should have vfs > > > > changes: the way the struct file f_inode and f_ops are hijacked > > > > is a bit nasty and perhaps d_select_inode() could be made a bit > > > > cleverer to help us here instead. > > > > > > I'm not sure if this PoC works... but you sure you didn't > > > introduce a serious vulnerability here ? you use a new mapping > > > and you update current_fsuid() creds up, which is global on any > > > fs operation, so may be: lets operate on any inode, update our > > > current_fsuid()... and access the rest of *unshifted filesystems* > > > ... !? > > > > The credentials are per thread, so it's a standard way of doing > > credential shifting and no other threads of execution in the same > > task get access. As long as you bound the override_creds()/revert_c > > reds() pairs within the kernel, you're safe. > > No, and here sorry I mean shifted. > > current_fsuid() is global through all fs operations which means it > crosses user namespaces... it was safe the days of only init_user_ns, > not anymore... You give a mapping inside containers to fsuid where > they don't want to have it... this allows to operate on inodes inside > other containers... update current_fsuid() even if we want that user > to be nobody inside the container... and later it can access the > inodes of the shifted fs... and by same current of course... OK, I still don't understand what you're getting at. There are three per-thread uids: uid, euid and fsuid (real, effective and filesystem). They're all either settable via syscall or inherited on fork. They're all kernel side, meaning they're kuid_t. Their values stay invariant as you move through namespaces. They change (and get mapped according to the current user namespace setting) when you call set[fe]uid() So when I enter a user namespace with mapping 0 100000 1000 and call setuid(0) (which sets all three). they all pick up the kuid_t of 100000. This means that writing a file inside the user namespace after calling setuid(0) appears as real uid 100000 on the medium even though if I call getuid() from the namespace, I get back 0. What shiftfs does is hijack temporarily the kernel fsuid/fsgid for permission checks, so you can remap to any old uid on the medium (although usually you'd pass in uidmap=0:100000:1000") it maps back from kuid_t 100000 to kuid_t 0, which is why the container can now read and write the underlying medium at on-media id 0 even through root inside the container has kuid_t 100000. There's no permanent change of fsuid and it stays at its invariant value for the thread except as a temporary measure to do the permission checks on the underlying of the shifted filesystem. > > > The worst thing is that current_fsuid() does not follow now the > > > /proc/self/uid_map interface! this is a serious vulnerability and > > > a mix of the current semantics... it's updated but using other > > > rules...? > > > > current_fsuid() is aready mapped via the userns; it's already a > > kuid_t at its final value. Shifting that is what you want to remap > > underlying volume uid/gid's. The uidmap/gidmap inputs to this are > > shifts on the final underlying uid/gids. > > => some points: > Changing setfsuid() its interfaces and rules... or an indrect way to > break another syscall... There is no change to setfsuid(). > The userns used for *mapping* is totatly different and not standard.. > . losing "init_user_ns and its decendents userns *semantics*...", a > yet a totatly unlinked mapping... There is no user namespace mapping at all. This is a simple shift, kernel side, of uids and gids at their kuid_t values. > Breaking current_uid(),current_euid(),current_fsuid() which are > mapped but in *different* user namespaces... hence different values > inside namespaces... you can change your userns mapping but that > current_fsuid specific one will always be remapped to some other > value inside even if you don't want it... It crosses user > namespaces... uid and euid are remapped according to /proc/self/uid_ > map, fsuid is remapped according to this new interface... > > Hard coding the mapping, nested containers/apps may *share* fsuid and > can't get rid of it even if they change the inside userns mapping to > disable, split, reduce mapped users or offer better isolation they > can't... no way to make private inodes inside containers if they > share the final fsuid, inside container mapping is ignored... > ... OK, I think there's a misunderstanding about how credential overrides work. They're not permanent changes to the credentials, they're temporary ones to get stuff done within the kernel at a temporary privilege. You can make credentials permanent if you go through prepare_creds()/commit_creds(), but for making them temporary you do prepare_creds()/override_creds() and then revert_creds() once you're done using them. If you want to see a current use of this, try fs/open.c:faccessat. What it's doing is temporarily overriding fsuid with the real uid to check the permissions before reverting the credentials and returning to the user. James > > the privileged ids down to 100000, but I have a volume which still > > has realids, I can mount that volume using shiftfs with > > uidmap=0:100000:1000 and it will allow this userns to read and > > write the volume through its remapped ids. > > > > > For overlayfs I did write an expriment but for me it's not an > > > overlayfs or another new filesystem problem... we are > > > manipulating UID/GID identities... > > > > > > It would have been better if you did send this as a separate > > > thread. It was a vfs:userns RFC fix which if we continue we turn > > > it into a complicated thing! implement another new light > > > filesystem with userns... (overlayfs...) > > > > > > Will follow up if the appropriate thread is created, not here, I > > > guess it's ok ? > > > > Well, I can resend the patch as a separate thread when I've fixed > > some of the problems viro pointed out. > > > > James > > > > > > James > > > > > > > > > > Thank you for your feedback! > > > > > > > > > > Thanks! > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html