On Wed, May 27, 2020 at 06:05:55PM +0000, Nicolas Viennot wrote: > > > Also in this thread Kamil mentioned that they also need calling prctl > > > with PR_SET_MM during restore in their production setup. > > > > We're using that as well but it really feels like this: > > > > prctl_map = (struct prctl_mm_map){ > > .start_code = start_code, > > .end_code = end_code, > > .start_stack = start_stack, > > .start_data = start_data, > > .end_data = end_data, > > .start_brk = start_brk, > > .brk = brk_val, > > .arg_start = arg_start, > > .arg_end = arg_end, > > .env_start = env_start, > > .env_end = env_end, > > .auxv = NULL, > > .auxv_size = 0, > > .exe_fd = -1, > > }; > > > > should belong under ns_capable(CAP_SYS_ADMIN). Why is that necessary to relax? > > When the prctl(PR_SET_MM_MAP...), the only privileged operation is to change the symlink of /proc/self/exe via set_mm_exe_file(). > See https://github.com/torvalds/linux/blob/444fc5cde64330661bf59944c43844e7d4c2ccd8/kernel/sys.c#L2001-L2004 > It needs CAP_SYS_ADMIN of the current namespace. This already has been relaxed before (see commit below) and why I'm at least symbolically pushing back is that I'm getting worried that we're just removing restrictions left and right and making kernel interfaces available to fully unprivileged user that very much seem to belong into the realm of local cap_sys_admin. prctl: Allow local CAP_SYS_ADMIN changing exe_file During checkpointing and restore of userspace tasks we bumped into the situation, that it's not possible to restore the tasks, which user namespace does not have uid 0 or gid 0 mapped. People create user namespace mappings like they want, and there is no a limitation on obligatory uid and gid "must be mapped". So, if there is no uid 0 or gid 0 in the mapping, it's impossible to restore mm->exe_file of the processes belonging to this user namespace. Also, there is no a workaround. It's impossible to create a temporary uid/gid mapping, because only one write to /proc/[pid]/uid_map and gid_map is allowed during a namespace lifetime. If there is an entry, then no more mapings can't be written. If there isn't an entry, we can't write there too, otherwise user task won't be able to do that in the future. The patch changes the check, and looks for CAP_SYS_ADMIN instead of zero uid and gid. This allows to restore a task independently of its user namespace mappings. > > I would argue that setting the current process exe file check should just be reduced to a "can you ptrace a children" check. > Here's why: any process can masquerade into another executable with ptrace. > One can fork a child, ptrace it, have the child execve("target_exe"), then replace its memory content with an arbitrary program. Then it should probably be relaxed to CAP_SYS_PTRACE in the user namespace and not CAP_CHECKPOINT_RESTORE. (But apparently you also have a way of achieving what you want anyway. Imho, it's not necessarily wrong to require a bit more work when you want something like fully unprivileged c/r that's a rather special interest group.) > With CRIU's libcompel parasite mechanism (https://criu.org/Compel) this is fairly easy to implement. > In fact, we could modify CRIU to do just that (but with a fair amount of efforts due to the way CRIU is written), > and not rely on being able to SET_MM_EXE_FILE via prctl(). In turn, that would give an easy way to masquerade any process > into another one, provided that one can ptrace a child. > > When not using PR_SET_MM_MAP, but using SET_MM_EXE_FILE, the CAP_RESOURCES at the root namespace level is required: > https://github.com/torvalds/linux/blob/444fc5cde64330661bf59944c43844e7d4c2ccd8/kernel/sys.c#L2109 > This seems inconsistent. Also for some reason changing auxv is not privileged if using prctl via the MM_MAP mechanism, but is privileged otherwise. Fwiw, it always helps if people take the time to dig through the history of specifc changes. That usually helps explain why things ended up as confusing as they are now: commit f606b77f1a9e362451aca8f81d8f36a3a112139e Author: Cyrill Gorcunov <gorcunov@xxxxxxxxxx> Date: Thu Oct 9 15:27:37 2014 -0700 prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation During development of c/r we've noticed that in case if we need to support user namespaces we face a problem with capabilities in prctl(PR_SET_MM, ...) call, in particular once new user namespace is created capable(CAP_SYS_RESOURCE) no longer passes. A approach is to eliminate CAP_SYS_RESOURCE check but pass all new values in one bundle, which would allow the kernel to make more intensive test [snip] Still note that updating exe-file link now doesn't require sys-resource capability anymore, after all there is no much profit in preventing setup own file link (there are a number of ways to execute own code -- ptrace, ld-preload, so that the only reliable way to find which exactly code is executed is to inspect running program memory). Still we require the caller to be at least user-namespace root user. I believe the old interface should be deprecated and ripped off in a couple of kernel releases if no one against. Sine you already have an interface that works and the argument for the new interface is that it let's the kernel do better validation if all arguments be passed at once I don't see the point in removing the CAP_SYS_RESOURCE restricting from the old interface possible introducing more maintenance burden or bugs when changing this. It would also encourage users to use an interface that even c/r people seemed to have viewed as being deprecated. Christian