On Fri, May 22, 2020 at 09:40:37AM -0700, Casey Schaufler wrote: > On 5/21/2020 10:53 PM, Adrian Reber wrote: > > This enables CRIU to checkpoint and restore a process as non-root. > > I know it sounds pedantic, but could you spell out CRIU once? > While I know that everyone who cares either knows or can guess > what you're talking about, it may be a mystery to some of the > newer kernel developers. Sure. CRIU - Checkpoint/Restore In Userspace. > > Over the last years CRIU upstream has been asked a couple of time if it > > is possible to checkpoint and restore a process as non-root. The answer > > usually was: 'almost'. > > > > The main blocker to restore a process was that selecting the PID of the > > restored process, which is necessary for CRIU, is guarded by CAP_SYS_ADMIN. > > What are the other blockers? Are you going to suggest additional new > capabilities to clear them? As mentioned somewhere else access to /proc/<pid>/map_files/ would be helpful. Right now I am testing with a JVM and it works without root just with the attached patch. Without access to /proc/<pid>/map_files/ not everything CRIU can do will actually work, but we are a lot closer to what our users have been asking for. > > In the last two years the questions about checkpoint/restore as non-root > > have increased and especially in the last few months we have seen > > multiple people inventing workarounds. > > Giving a process CAP_SYS_ADMIN is a non-root solution. Yes, but like mentioned somewhere else not a solution that actually works, because CAP_SYS_ADMIN allows too much. Especially for the checkpoint/restore case, we really need one (setting the PID of a new process) and to make it complete a second (reading map_files). Reading the comments in include/uapi/linux/capability.h concerning CAP_SYS_ADMIN it allows the binary to do at least 35 things. The two (three) I mentioned above (ns_last_pid (clone3) map_files) are not mentioned in that list, so CAP_SYS_ADMIN allows probably much more. To allow checkpoint/restore as non-root nobody will give CRIU CAP_SYS_ADMIN because it is too wide. > > The use-cases so far and their workarounds: > > > > * Checkpoint/Restore in an HPC environment in combination with > > a resource manager distributing jobs. Users are always running > > as non root, but there was the desire to provide a way to > > checkpoint and restore long running jobs. > > Workaround: setuid wrapper to start CRIU as root as non-root > > https://github.com/FredHutch/slurm-examples/blob/master/checkpointer/lib/checkpointer/checkpointer-suid.c > > This is a classic and well understood mechanism for dealing with > this kind of situation. You could have checkpointer-filecap-sys_admin.c > instead, if you want to reduce use of the super-user. > > > * Another use case to checkpoint/restore processes as non-root > > uses as workaround a non privileged process which cycles through > > PIDs by calling fork() as fast as possible with a rate of > > 100,000 pids/s instead of writing to ns_last_pid > > https://github.com/twosigma/set_ns_last_pid > > Oh dear. > > > * Fast Java startup using checkpoint/restore. > > We have been in contact with JVM developers who are integrating > > CRIU into a JVM to decrease the startup time. > > Workaround so far: patch out CAP_SYS_ADMIN checks in the kernel > > That's not a workaround, it's a policy violation. > Bad JVM! No biscuit! This was used as a proof of concept to see if we can checkpoint and restore a JVM without root. Only the ns_last_pid check was removed to see if it works and it does. > > * Container migration as non root. There are people already > > using CRIU to migrate containers as non-root. The solution > > there is to run it in a user namespace. So if you are able > > to carefully setup your environment with the namespaces > > it is already possible to restore a container/process as non-root. > > This is exactly the kind of situation that user namespaces are > supposed to address. > > > Unfortunately it is not always possible to setup an environment > > in such a way and for easier access to non-root based container > > migration this patch is also required. > > If a user namespace solution is impossible or (more likely) too > expensive, there's always the checkpointer-filecap-sys_admin option. But then again we open up all of CAP_SYS_ADMIN, which is not necessary. > > There are probably a few more things guarded by CAP_SYS_ADMIN required > > to run checkpoint/restore as non-root, > > If you need CAP_SYS_ADMIN anyway you're not gaining anything by > separating out CAP_RESTORE. No, as described we can checkpoint and restore a JVM with this patch and it also solves the problem the set_ns_last_pid fork() loop daemon tries to solve. It is not enough to support the full functionality of CRIU as map_files is also important, but we do not need CAP_SYS_ADMIN and CAP_RESTORE. Only CAP_RESTORE would be necessary. With a new capability users can enable checkpoint/restore as non-root without giving CRIU access to any of the other possibilities offered by CAP_SYS_ADMIN. Setting a PID and map_files have been introduced for CRIU and used to live behind CONFIG_CHECKPOINT_RESTORE. Having a capability for checkpoint/restore would make it easier for CRIU users to run it as non-root and make it very clear what is possible when giving CRIU the new capability. No other things would be allowed than necessary for checkpoint/restore. Setting a PID is most important for the restore part and reading map_files would be helpful during checkpoint. So it actually should be called CAP_CHECKPOINT_RESTORE as Christian mentioned in another email. > > but by applying this patch I can > > already checkpoint and restore processes as non-root. As there are > > already multiple workarounds I would prefer to do it correctly in the > > kernel to avoid that CRIU users are starting to invent more workarounds. > > You've presented a couple of really inappropriate implementations > that would qualify as workarounds. But the other two are completely > appropriate within the system security policy. They don't "get around" > the problem, they use existing mechanisms as they are intended. I agree with the user namespace approach to be appropriate, but not the CAP_SYS_ADMIN approach as CRIU only needs a tiny subset (2 things) of what CAP_SYS_ADMIN allows. > > I have used the following tests to verify that this change works as > > expected by setting the new capability CAP_RESTORE on the two resulting > > test binaries: > > > > $ cat ns_last_pid.c > > // http://efiop-notes.blogspot.com/2014/06/how-to-set-pid-using-nslastpid.html > > #include <stdio.h> > > #include <stdlib.h> > > #include <string.h> > > #include <sys/file.h> > > #include <sys/types.h> > > #include <unistd.h> > > > > int main(int argc, char *argv[]) > > { > > pid_t pid, new_pid; > > char buf[32]; > > int fd; > > > > if (argc != 2) > > return 1; > > > > printf("Opening ns_last_pid...\n"); > > fd = open("/proc/sys/kernel/ns_last_pid", O_RDWR | O_CREAT, 0644); > > if (fd < 0) { > > perror("Cannot open ns_last_pid"); > > return 1; > > } > > > > printf("Locking ns_last_pid...\n"); > > if (flock(fd, LOCK_EX)) { > > close(fd); > > printf("Cannot lock ns_last_pid\n"); > > return 1; > > } > > > > pid = atoi(argv[1]); > > snprintf(buf, sizeof(buf), "%d", pid - 1); > > printf("Writing pid-1 to ns_last_pid...\n"); > > if (write(fd, buf, strlen(buf)) != strlen(buf)) { > > printf("Cannot write to buf\n"); > > return 1; > > } > > > > printf("Forking...\n"); > > new_pid = fork(); > > if (new_pid == 0) { > > printf("I am the child!\n"); > > exit(0); > > } else if (new_pid == pid) > > printf("I am the parent. My child got the pid %d!\n", new_pid); > > else > > printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid); > > > > printf("Cleaning up...\n"); > > if (flock(fd, LOCK_UN)) > > printf("Cannot unlock\n"); > > close(fd); > > return 0; > > } > > $ id -u; /home/libcap/ns_last_pid 300000 > > 1001 > > Opening ns_last_pid... > > Locking ns_last_pid... > > Writing pid-1 to ns_last_pid... > > Forking... > > I am the parent. My child got the pid 300000! > > I am the child! > > Cleaning up... > > > > For the clone3() based approach: > > $ cat clone3_set_tid.c > > #define _GNU_SOURCE > > #include <linux/sched.h> > > #include <stdint.h> > > #include <stdio.h> > > #include <stdlib.h> > > #include <string.h> > > #include <sys/types.h> > > #include <sys/stat.h> > > #include <sys/syscall.h> > > #include <unistd.h> > > > > #define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr))) > > > > int main(int argc, char *argv[]) > > { > > struct clone_args c_args = { }; > > pid_t pid, new_pid; > > > > if (argc != 2) > > return 1; > > > > pid = atoi(argv[1]); > > c_args.set_tid = ptr_to_u64(&pid); > > c_args.set_tid_size = 1; > > > > printf("Forking...\n"); > > new_pid = syscall(__NR_clone3, &c_args, sizeof(c_args)); > > if (new_pid == 0) { > > printf("I am the child!\n"); > > exit(0); > > } else if (new_pid == pid) > > printf("I am the parent. My child got the pid %d!\n", new_pid); > > else > > printf("pid (%d) does not match expected pid (%d)\n", new_pid, pid); > > printf("Done\n"); > > > > return 0; > > } > > $ id -u; /home/libcap/clone3_set_tid 300000 > > 1001 > > Forking... > > I am the parent. My child got the pid 300000! > > Done > > I am the child! > > > > Signed-off-by: Adrian Reber <areber@xxxxxxxxxx> > > --- > > include/linux/capability.h | 5 +++++ > > include/uapi/linux/capability.h | 9 ++++++++- > > kernel/pid.c | 2 +- > > kernel/pid_namespace.c | 2 +- > > security/selinux/include/classmap.h | 5 +++-- > > 5 files changed, 18 insertions(+), 5 deletions(-) > > > > diff --git a/include/linux/capability.h b/include/linux/capability.h > > index b4345b38a6be..1278313cb2bc 100644 > > --- a/include/linux/capability.h > > +++ b/include/linux/capability.h > > @@ -261,6 +261,11 @@ static inline bool bpf_capable(void) > > return capable(CAP_BPF) || capable(CAP_SYS_ADMIN); > > } > > > > +static inline bool restore_ns_capable(struct user_namespace *ns) > > +{ > > + return ns_capable(ns, CAP_RESTORE) || ns_capable(ns, CAP_SYS_ADMIN); > > +} > > + > > /* audit system wants to get cap info from files as well */ > > extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps); > > > > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h > > index c7372180a0a9..4bcc4e3d41ff 100644 > > --- a/include/uapi/linux/capability.h > > +++ b/include/uapi/linux/capability.h > > @@ -406,7 +406,14 @@ struct vfs_ns_cap_data { > > */ > > #define CAP_BPF 39 > > > > -#define CAP_LAST_CAP CAP_BPF > > + > > +/* Allow checkpoint/restore related operations */ > > +/* Allow PID selection during clone3() */ > > +/* Allow writing to ns_last_pid */ > > + > > +#define CAP_RESTORE 40 > > + > > +#define CAP_LAST_CAP CAP_RESTORE > > > > #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP) > > > > diff --git a/kernel/pid.c b/kernel/pid.c > > index 3122043fe364..bbc26f2bcff6 100644 > > --- a/kernel/pid.c > > +++ b/kernel/pid.c > > @@ -198,7 +198,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, > > if (tid != 1 && !tmp->child_reaper) > > goto out_free; > > retval = -EPERM; > > - if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN)) > > + if (!restore_ns_capable(tmp->user_ns)) > > goto out_free; > > set_tid_size--; > > } > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > > index 0e5ac162c3a8..f58186b31ce6 100644 > > --- a/kernel/pid_namespace.c > > +++ b/kernel/pid_namespace.c > > @@ -269,7 +269,7 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write, > > struct ctl_table tmp = *table; > > int ret, next; > > > > - if (write && !ns_capable(pid_ns->user_ns, CAP_SYS_ADMIN)) > > + if (write && !restore_ns_capable(pid_ns->user_ns)) > > return -EPERM; > > > > /* > > diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h > > index 98e1513b608a..f8b8f12a6ebd 100644 > > --- a/security/selinux/include/classmap.h > > +++ b/security/selinux/include/classmap.h > > @@ -27,9 +27,10 @@ > > "audit_control", "setfcap" > > > > #define COMMON_CAP2_PERMS "mac_override", "mac_admin", "syslog", \ > > - "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf" > > + "wake_alarm", "block_suspend", "audit_read", "perfmon", "bpf", \ > > + "restore" > > > > -#if CAP_LAST_CAP > CAP_BPF > > +#if CAP_LAST_CAP > CAP_RESTORE > > #error New capability defined, please update COMMON_CAP2_PERMS. > > #endif > > > > > > base-commit: e8f3274774b45b5f4e9e3d5cad7ff9f43ae3add5 >