Michal Hocko <mhocko@xxxxxxxxxx> writes: > On Mon 17-10-16 11:39:49, Eric W. Biederman wrote: >> >> During exec dumpable is cleared if the file that is being executed is >> not readable by the user executing the file. A bug in >> ptrace_may_access allows reading the file if the executable happens to >> enter into a subordinate user namespace (aka clone(CLONE_NEWUSER), >> unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER). >> >> This problem is fixed with only necessary userspace breakage by adding >> a user namespace owner to mm_struct, captured at the time of exec, >> so it is clear in which user namespace CAP_SYS_PTRACE must be present >> in to be able to safely give read permission to the executable. >> >> The function ptrace_may_access is modified to verify that the ptracer >> has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns. >> This ensures that if the task changes it's cred into a subordinate >> user namespace it does not become ptraceable. > > I haven't studied your patch too deeply but one thing that immediately > raised a red flag was that mm might be shared between processes (aka > thread groups). What prevents those two to sit in different user > namespaces? > > I am primarily asking because this generated a lot of headache for the > memcg handling as those processes might sit in different cgroups while > there is only one correct memcg for them which can disagree with the > cgroup associated with one of the processes. That is a legitimate concern, but I do not see any of those kinds of issues here. Part of the memcg pain comes from the fact that control groups are process centric, and part of the pain comes from the fact that it is possible to change control groups. What I am doing is making the mm owned by a user namespace (at creation time), and I am not allowing changes to that ownership. The credentials of the tasks that use that mm may be in the same user namespace or descendent user namespaces. The core goal is to enforce the unreadability of an mm when an non-readable file is executed. This is a time of mm creation property. The enforcement of which fits very well with the security/permission checking role of the user namespace. Could this use of mm->user_ns be extended for some kind of accounting/limiting in the future? Possibly. I can imagine a limit on the total number of page table entries a group of processes are allowed to have as being a sane kind of limit in this setting much like RLIMIT_AS is sane on a single mm level. Pages don't belong to mm's so I can't imagine anything like the memcg being built on this kind of infrastructure. Eric _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers