On 2023-07-14, Aleksa Sarai <cyphar@xxxxxxxxxx> wrote: > This sysctl has the very unusal behaviour of not allowing any user (even > CAP_SYS_ADMIN) to reduce the restriction setting, meaning that if you > were to set this sysctl to a more restrictive option in the host pidns > you would need to reboot your machine in order to reset it. > > The justification given in [1] is that this is a security feature and > thus it should not be possible to disable. Aside from the fact that we > have plenty of security-related sysctls that can be disabled after being > enabled (fs.protected_symlinks for instance), the protection provided by > the sysctl is to stop users from being able to create a binary and then > execute it. A user with CAP_SYS_ADMIN can trivially do this without > memfd_create(2): > > % cat mount-memfd.c > #include <fcntl.h> > #include <string.h> > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <linux/mount.h> > > #define SHELLCODE "#!/bin/echo this file was executed from this totally private tmpfs:" > > int main(void) > { > int fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC); > assert(fsfd >= 0); > assert(!fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 2)); > > int dfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0); > assert(dfd >= 0); > > int execfd = openat(dfd, "exe", O_CREAT | O_RDWR | O_CLOEXEC, 0782); 0777 Oops. I must've garbled something when copying from my test program. > assert(execfd >= 0); > assert(write(execfd, SHELLCODE, strlen(SHELLCODE)) == strlen(SHELLCODE)); > assert(!close(execfd)); > > char *execpath = NULL; > char *argv[] = { "bad-exe", NULL }, *envp[] = { NULL }; > execfd = openat(dfd, "exe", O_PATH | O_CLOEXEC); > assert(execfd >= 0); > assert(asprintf(&execpath, "/proc/self/fd/%d", execfd) > 0); > assert(!execve(execpath, argv, envp)); > } > % ./mount-memfd > this file was executed from this totally private tmpfs: /proc/self/fd/5 > % > > Given that it is possible for CAP_SYS_ADMIN users to create executable > binaries without memfd_create(2) and without touching the host > filesystem (not to mention the many other things a CAP_SYS_ADMIN process > would be able to do that would be equivalent or worse), it seems strange > to cause a fair amount of headache to admins when there doesn't appear > to be an actual security benefit to blocking this. > > It should be noted that with this change, programs that can do an > unprivileged unshare(CLONE_NEWUSER) would be able to create an > executable memfd even if their current pidns didn't allow it. However, > the same sample program above can also be used in this scenario, meaning > that even with this consideration, blocking CAP_SYS_ADMIN makes little > sense: > > % unshare -rm ./mount-memfd > this file was executed from this totally private tmpfs: /proc/self/fd/5 > > This simply further reinforces that locked-down environments need to > disallow CLONE_NEWUSER for unprivileged users (as is already the case in > most container environments). > > [1]: https://lore.kernel.org/all/CABi2SkWnAgHK1i6iqSqPMYuNEhtHBkO8jUuCvmG3RmUB5TKHJw@xxxxxxxxxxxxxx/ > > Cc: Dominique Martinet <asmadeus@xxxxxxxxxxxxx> > Cc: Christian Brauner <brauner@xxxxxxxxxx> > Cc: stable@xxxxxxxxxxxxxxx # v6.3+ > Fixes: 105ff5339f49 ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC") > Signed-off-by: Aleksa Sarai <cyphar@xxxxxxxxxx> > --- > kernel/pid_sysctl.h | 7 ------- > 1 file changed, 7 deletions(-) > > diff --git a/kernel/pid_sysctl.h b/kernel/pid_sysctl.h > index b26e027fc9cd..8a22bc29ebb4 100644 > --- a/kernel/pid_sysctl.h > +++ b/kernel/pid_sysctl.h > @@ -24,13 +24,6 @@ static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, > if (ns != &init_pid_ns) > table_copy.data = &ns->memfd_noexec_scope; > > - /* > - * set minimum to current value, the effect is only bigger > - * value is accepted. > - */ > - if (*(int *)table_copy.data > *(int *)table_copy.extra1) > - table_copy.extra1 = table_copy.data; > - > return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos); > } I also have a patch to properly tie the sysctl to the pid namespace rather that having a global sysctl that magically has its value changed in this pid_mfd_noexec_dointvec_minmax() and another to do the same for the other pidns-tied sysctl (kernel.ns_last_pid) but I'm not sure whether it's needed. It does make vm.memfd_noexec a bit cleaner but because the two sysctls are in different tables you can't register them together AFAICS which means a bunch of needless duplication. > > -- > 2.41.0 > -- Aleksa Sarai Senior Software Engineer (Containers) SUSE Linux GmbH <https://www.cyphar.com/>
Attachment:
signature.asc
Description: PGP signature