How should rlimits, suid exec, and capabilities interact?

"Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> · Wed, 23 Feb 2022 12:00:16 -0600

[CC'd the security list because I really don't know who the right people
 are to drag into this discussion]

While looking at some issues that have cropped up with making it so
that RLIMIT_NPROC cannot be escaped by creating a user namespace I have
stumbled upon a very old issue of how rlimits and suid exec interact
poorly.

This specific saga starts with commit 909cc4ae86f3 ("[PATCH] Fix two
bugs with process limits (RLIMIT_NPROC)") from
https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git which
essentially replaced a capable() check with a an open-coded
implementation of suser(), for RLIMIT_NPROC.

The description from Neil Brown was:

   1/ If a setuid process swaps it's real and effective uids and then forks,
     the fork fails if the new realuid has more processes
     than the original process was limited to.
     This is particularly a problem if a user with a process limit
     (e.g. 256) runs a setuid-root program which does setuid() + fork()
     (e.g. lprng) while root already has more than 256 process (which
     is quite possible).

     The root problem here is that a limit which should be a per-user
     limit is being implemented as a per-process limit with
     per-process (e.g. CAP_SYS_RESOURCE) controls.
     Being a per-user limit, it should be that the root-user can over-ride
     it, not just some process with CAP_SYS_RESOURCE.

     This patch adds a test to ignore process limits if the real user is root.

The test to see if the real user is root was:
	if (p->real_cred->user != INIT_USER) ...
which persists to this day in fs/fork.c:copy_process().

The practical problem with this test is that it works like nothing else
in the kernel, and so does not look like what it is.  Saying:
	if (!uid_eq(p->real_cred->uid, GLOBAL_ROOT_USER)) ...

would at least be more recognizable.

Really this entire test should be if (!capable(CAP_SYS_RESOURCE) because
CAP_SYS_RESOURCE is the capability that controls if you are allowed to
exceed your rlimits.

Which brings us to the practical issues of how all of these things are
wired together today.

The per-user rlimits are accounted based upon a processes real user, not
the effective user.  All other permission checks are based upon the
effective user.  This has the practical effect that uids are swapped as
above that the processes are charged to root, but use the permissions of
an ordinary user.

The problems get worse when you realize that suid exec does not reset
any of the rlimits except for RLIMIT_STACK.

The rlimits that are particularly affected and are per-user are:
RLIMIT_NPROC, RLIMIT_MSGQUEUE, RLIMIT_SIGPENDING, RLIMIT_MEMLOCK.

But I think failing to reset rlimits during exec has the potential to
effect any suid exec.

Does anyone have any historical knowledge or sense of how this should
work?

Right now it feels like we have coded ourselves into a corner and will
have to risk breaking userspace to get out of it.  AKA I think we need
a policy of reseting rlimits on suid exec, and I think we need to store
global rlimits based upon the effective user not the real user.  Those
changes should allow making capable calls where they belong, and
removing the much too magic user == INIT_USER test for RLIMIT_NPROC.

Eric