On mån, 2016-03-07 at 21:15 -0800, Andy Lutomirski wrote: > Hi all- > > I think there are three main types of concerns. First, there might > be > some as-yet-unknown semantic issues that would allow privilege > escalation by users who create user namespaces and then confuse > something else in the system. Second, enabling user namespaces > exposes a lot of attack surface to unprivileged users. Third, > allowing tasks to create user namespaces exposes the kernel to > various > resource exhaustion attacks that wouldn't be possible otherwise. In my work on xdg-app i've seen some issues that I'd ideally would like to see a solution to. They are not necessarily security vulnerabilities, but still problems: devpts is only mountable in a user namespace if the root user is mapped. Possible to work around, but ugly. There is no way to recursively apply mount flags. For example, I often want to recursively bind mount some directory from the host but with MS_READONLY|MS_NODEV. I cannot apply the flags in the MS_BIND|MS_REC mount, so instead i have to first bind mount and then remount. However, the remount is not recursive, so i have to manually parse /proc/self/mountinfo and figure out all the submounts that were added. Also, I have to manually avoid trying to remount covered mounts, because I can't reach those, and for each remount I have to parse out its current flags so i don't accidentally unset some set flag, causing EPERM. Mount flags are not applied on propagated mounts. Even if I do all the stuff above, if i get a *new* mount propagated into my namespace, or if a parent unmount is propagated uncovering an mount in my namespace, then this new mountpoint is not read-only. This has no workaround that I'm currently aware of. Abstract unix domain sockets are tied to the network namespace. I understand where this comes from, socket syscalls are "networkish". However, the non-abstract unix domain sockets are under the control of the filesystem namespace, and I can fully control them when setting up the sandbox. But, as long as the sandbox share the network namespace with the host (which is likely for desktop apps) it will have full access to all services listening on abstract sockets on the host. This is particularly problematic because 1) abstract sockets have no file permissions, so any Xserver running on the host is wide open, 2) Whether a connect call uses abstract sockets is not detectable via seccomp, so we can't filter it in any other way. I don't know how sever this is, as it depends on how trusty the individual services are but at least on my system "grep @ /proc/net/unix" lists session dbus instances, X server, and some iSCSI thing. /proc (even the limited pid namespace one) contains a lot of old cruft that at a minimum leaks hardware info to the sandbox, and could potentially do worse (/proc/sysrq-trigger anyone?). I'd like to be able to mount a "clean" /proc that has only the process-related stuff. > +++ What does the privilege of creating a user namespace entail? +++ > > > It might be more interesting to allow a task to unshare all > namespaces, hold all capabilities in them, but to still be unable to > use certain privileged facilities. For example, maybe denying > administrative control over iptables, creation of exotic network > interface types, or similar would make sense. > I don't know how we'd specify this type of constraint. I think this particular issue is the main problem here. Unless we add some very course bit-flags that specify the constraints it is going to be a very complex API to set up such constraints. Adding course bit- flags essentially means adding new capabilities (maybe subsetting existing ones). Given how hard it is to understand how all the current capabilities interact and how they can be exploited I'm not sure this is a great idea. Maybe we can use the LSM framework to model the constraints? For instance, the user could be allowed to create user namespaces, but they processes in it automatically get some selinux context applied. Then that selinux context could be configured to limit access to certain operations. > +++ Who can create user namespaces (possibly with restrictions)? +++ > > I can think of a few formulations. > > A simpler approach would be to add a per-namespace setting listing > users and/or groups that can unshare their userns. A userns starts > out allowing everyone to unshare userns, and anyone with > CAP_SYS_ADMIN > can change the setting. This sounds like a cgroup controller to me. It makes sense for my usecase (i.e. sandboxed desktop apps). You want to give all processes in the users login session access to user namespaces, but not necessary to e.g. a service or background process or a cron job running as that user. > A fancier approach would be to have an fd that represents the right > to > unshare your userns. Some privilege broker could give out those fds > to apps that need them and meet whatever criteria are set. If you > try > to unshare your userns without the fd, it falls back to some simpler > policy. In practice though, how would the privilege broken know and apply the criteria. Its not even got the information the kernel has (such as race-free access to the peer cgroup). -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Alexander Larsson Red Hat, Inc alexl@xxxxxxxxxx alexander.larsson@xxxxxxxxx He's an ungodly devious paramedic on his last day in the job. She's a sharp-shooting cigar-chomping archaeologist married to the Mob. They fight crime! _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers