Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes: > On Sun, Jan 10, 2021 at 9:34 AM Alexey Gladkov <gladkov.alexey@xxxxxxxxx> wrote: >> >> To address the problem, we bind rlimit counters to each user namespace. The >> result is a tree of rlimit counters with the biggest value at the root (aka >> init_user_ns). The rlimit counter increment/decrement occurs in the current and >> all parent user namespaces. > > I'm not seeing why this is necessary. > > Maybe it's the right approach, but none of the patches (or this cover > letter email) really explain it to me. > > I understand why you might want the _limits_ themselves would form a > tree like this - with the "master limit" limiting the limits in the > user namespaces under it. > > But I don't understand why the _counts_ should do that. The 'struct > user_struct' should be shared across even user namespaces for the same > user. > > IOW, the very example of the problem you quote seems to argue against this: > >> For example, there are two containers (A and B) created by one user. The >> container A sets RLIMIT_NPROC=1 and starts one process. Everything is fine, but >> when container B tries to do the same it will fail because the number of >> processes is counted globally for each user and user has one process already. > > Note how the problem was _not_ that the _count_ was global. That part > was fine and all good. The problem is fundamentally that the per process RLIMIT_NPROC was compared against the user_struct->processes. I have only heard the problem described but I believe it is either the RLIMIT_NPROC test in fork or at the beginning of do_execveat_common that is failing. >From fs/exec.c line 1866: > /* > * We move the actual failure in case of RLIMIT_NPROC excess from > * set*uid() to execve() because too many poorly written programs > * don't check setuid() return code. Here we additionally recheck > * whether NPROC limit is still exceeded. > */ > if ((current->flags & PF_NPROC_EXCEEDED) && > atomic_read(¤t_user()->processes) > rlimit(RLIMIT_NPROC)) { > retval = -EAGAIN; > goto out_ret; > } >From fs/fork.c line 1966: > retval = -EAGAIN; > if (atomic_read(&p->real_cred->user->processes) >= > task_rlimit(p, RLIMIT_NPROC)) { > if (p->real_cred->user != INIT_USER && > !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN)) > goto bad_fork_free; > } > current->flags &= ~PF_NPROC_EXCEEDED; In both the cases the RLIMIT_NPROC value comes from task->signal->rlim[RLIMIT_NPROC] and the count of processes comes from task->cred->user->processes. > No, the problem was that the _limit_ in container A also ended up > affecting container B. The description I have is that both containers run the same service that set it's RLIMIT_NPROC to 1 in both containers. > So to me, that says that it would make sense to continue to use the > resource counts in 'struct user_struct' (because if user A has a hard > limit of X, then creating a new namespace shouldn't expand that > limit), but then have the ability to make per-container changes to the > resource limits (as long as they are within the bounds of the parent > user namespace resource limit). I agree that needs to work as well. > Maybe there is some reason for this ucounts approach, but if so, I > feel it was not explained at all. Let me see if I can starte the example a litle more clearly. Suppose there is a service never_fork that sets RLIMIT_NPROC runs as never_fork_user and sets RLIMIT_NPROC to 1 in it's systemd service file. Further suppose there is a user bob who has two containers he wants to run: container1 and container2. Both containers start the never_fork service. Bob first starts container1 and inside it the never_fork service starts. Bob starts container2 and the never_fork service fails to start. Does that make it clear that it is the count of the processes that would exceed 1 if both instances of the never_fork service starts that would be the problem? Eric