[CC'd the security list because I really don't know who the right people are to drag into this discussion] While looking at some issues that have cropped up with making it so that RLIMIT_NPROC cannot be escaped by creating a user namespace I have stumbled upon a very old issue of how rlimits and suid exec interact poorly. This specific saga starts with commit 909cc4ae86f3 ("[PATCH] Fix two bugs with process limits (RLIMIT_NPROC)") from https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git which essentially replaced a capable() check with a an open-coded implementation of suser(), for RLIMIT_NPROC. The description from Neil Brown was: 1/ If a setuid process swaps it's real and effective uids and then forks, the fork fails if the new realuid has more processes than the original process was limited to. This is particularly a problem if a user with a process limit (e.g. 256) runs a setuid-root program which does setuid() + fork() (e.g. lprng) while root already has more than 256 process (which is quite possible). The root problem here is that a limit which should be a per-user limit is being implemented as a per-process limit with per-process (e.g. CAP_SYS_RESOURCE) controls. Being a per-user limit, it should be that the root-user can over-ride it, not just some process with CAP_SYS_RESOURCE. This patch adds a test to ignore process limits if the real user is root. The test to see if the real user is root was: if (p->real_cred->user != INIT_USER) ... which persists to this day in fs/fork.c:copy_process(). The practical problem with this test is that it works like nothing else in the kernel, and so does not look like what it is. Saying: if (!uid_eq(p->real_cred->uid, GLOBAL_ROOT_USER)) ... would at least be more recognizable. Really this entire test should be if (!capable(CAP_SYS_RESOURCE) because CAP_SYS_RESOURCE is the capability that controls if you are allowed to exceed your rlimits. Which brings us to the practical issues of how all of these things are wired together today. The per-user rlimits are accounted based upon a processes real user, not the effective user. All other permission checks are based upon the effective user. This has the practical effect that uids are swapped as above that the processes are charged to root, but use the permissions of an ordinary user. The problems get worse when you realize that suid exec does not reset any of the rlimits except for RLIMIT_STACK. The rlimits that are particularly affected and are per-user are: RLIMIT_NPROC, RLIMIT_MSGQUEUE, RLIMIT_SIGPENDING, RLIMIT_MEMLOCK. But I think failing to reset rlimits during exec has the potential to effect any suid exec. Does anyone have any historical knowledge or sense of how this should work? Right now it feels like we have coded ourselves into a corner and will have to risk breaking userspace to get out of it. AKA I think we need a policy of reseting rlimits on suid exec, and I think we need to store global rlimits based upon the effective user not the real user. Those changes should allow making capable calls where they belong, and removing the much too magic user == INIT_USER test for RLIMIT_NPROC. Eric