Alexey Gladkov <gladkov.alexey@xxxxxxxxx> writes: > Preface > ------- > These patches are for binding the rlimit counters to a user in user namespace. > This patch set can be applied on top of: > > git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git v5.12-rc4 > > Problem > ------- > The RLIMIT_NPROC, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE rlimits > implementation places the counters in user_struct [1]. These limits are global > between processes and persists for the lifetime of the process, even if > processes are in different user namespaces. > > To illustrate the impact of rlimits, let's say there is a program that does not > fork. Some service-A wants to run this program as user X in multiple containers. > Since the program never fork the service wants to set RLIMIT_NPROC=1. > > service-A > \- program (uid=1000, container1, rlimit_nproc=1) > \- program (uid=1000, container2, rlimit_nproc=1) > > The service-A sets RLIMIT_NPROC=1 and runs the program in container1. When the > service-A tries to run a program with RLIMIT_NPROC=1 in container2 it fails > since user X already has one running process. > > The problem is not that the limit from container1 affects container2. The > problem is that limit is verified against the global counter that reflects > the number of processes in all containers. > > This problem can be worked around by using different users for each container > but in this case we face a different problem of uid mapping when transferring > files from one container to another. > > Eric W. Biederman mentioned this issue [2][3]. > > Introduced changes > ------------------ > To address the problem, we bind rlimit counters to user namespace. Each counter > reflects the number of processes in a given uid in a given user namespace. The > result is a tree of rlimit counters with the biggest value at the root (aka > init_user_ns). The limit is considered exceeded if it's exceeded up in the tree. > > [1]: https://lore.kernel.org/containers/87imd2incs.fsf@xxxxxxxxxxxxxxxxxxxxx/ > [2]: https://lists.linuxfoundation.org/pipermail/containers/2020-August/042096.html > [3]: https://lists.linuxfoundation.org/pipermail/containers/2020-October/042524.html > > Changelog > --------- > v9: > * Used a negative value to check that the ucounts->count is close to overflow. > * Rebased onto v5.12-rc4. > > v8: > * Used atomic_t for ucounts reference counting. Also added counter overflow > check (thanks to Linus Torvalds for the idea). > * Fixed other issues found by lkp-tests project in the patch that Reimplements > RLIMIT_MEMLOCK on top of ucounts. > > v7: > * Fixed issues found by lkp-tests project in the patch that Reimplements > RLIMIT_MEMLOCK on top of ucounts. > > v6: > * Fixed issues found by lkp-tests project. > * Rebased onto v5.11. > > v5: > * Split the first commit into two commits: change ucounts.count type to atomic_long_t > and add ucounts to cred. These commits were merged by mistake during the rebase. > * The __get_ucounts() renamed to alloc_ucounts(). > * The cred.ucounts update has been moved from commit_creds() as it did not allow > to handle errors. > * Added error handling of set_cred_ucounts(). > > v4: > * Reverted the type change of ucounts.count to refcount_t. > * Fixed typo in the kernel/cred.c > > v3: > * Added get_ucounts() function to increase the reference count. The existing > get_counts() function renamed to __get_ucounts(). > * The type of ucounts.count changed from atomic_t to refcount_t. > * Dropped 'const' from set_cred_ucounts() arguments. > * Fixed a bug with freeing the cred structure after calling cred_alloc_blank(). > * Commit messages have been updated. > * Added selftest. > > v2: > * RLIMIT_MEMLOCK, RLIMIT_SIGPENDING and RLIMIT_MSGQUEUE are migrated to ucounts. > * Added ucounts for pair uid and user namespace into cred. > * Added the ability to increase ucount by more than 1. > > v1: > * After discussion with Eric W. Biederman, I increased the size of ucounts to > atomic_long_t. > * Added ucount_max to avoid the fork bomb. > > -- > > Alexey Gladkov (8): > Increase size of ucounts to atomic_long_t > Add a reference to ucounts for each cred > Use atomic_t for ucounts reference counting > Reimplement RLIMIT_NPROC on top of ucounts > Reimplement RLIMIT_MSGQUEUE on top of ucounts > Reimplement RLIMIT_SIGPENDING on top of ucounts > Reimplement RLIMIT_MEMLOCK on top of ucounts > kselftests: Add test to check for rlimit changes in different user namespaces > Overall this looks good, and it is a very good sign that the automatic testing bots have not found anything. I found a few little nits. But thing are looking very good. Eric