Hello. Thank you for your email and interested in using memfd_noexec ! On Wed, Jun 28, 2023 at 4:43 AM Dominique Martinet <asmadeus@xxxxxxxxxxxxx> wrote: > > jeffxu@xxxxxxxxxxxx wrote on Thu, Dec 15, 2022 at 12:12:03AM +0000: > > From: Jeff Xu <jeffxu@xxxxxxxxxx> > > > > The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to > > set executable bit at creation time (memfd_create). > > > > When MFD_NOEXEC_SEAL is set, memfd is created without executable bit > > (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to > > be executable (mode: 0777) after creation. > > > > when MFD_EXEC flag is set, memfd is created with executable bit > > (mode:0777), this is the same as the old behavior of memfd_create. > > > > The new pid namespaced sysctl vm.memfd_noexec has 3 values: > > 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like > > MFD_EXEC was set. > > 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like > > MFD_NOEXEC_SEAL was set. > > 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected. > > So, erm, I'm a bit late to the party but I was just looking at a way of > blocking memfd_create+exec in a container and this sounded perfect: my > reading is that this is a security feature meant to be set for > container's namespaces that'd totally disable something like > memfd_create followed by fexecve (because we don't want weird binaries > coming from who knows where to be executed on a shiny secure system), > but. . . is this actually supposed to work? > (see below) > > > [...] > > --- a/mm/memfd.c > > +++ b/mm/memfd.c > > @@ -263,12 +264,14 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) > > #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) > > #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) > > > > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) > > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC) > > > > SYSCALL_DEFINE2(memfd_create, > > const char __user *, uname, > > unsigned int, flags) > > { > > + char comm[TASK_COMM_LEN]; > > + struct pid_namespace *ns; > > unsigned int *file_seals; > > struct file *file; > > int fd, error; > > @@ -285,6 +288,39 @@ SYSCALL_DEFINE2(memfd_create, > > return -EINVAL; > > } > > > > + /* Invalid if both EXEC and NOEXEC_SEAL are set.*/ > > + if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL)) > > + return -EINVAL; > > + > > + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { > > + [code that checks the sysctl] > > If flags already has either MFD_EXEC or MFD_NOEXEC_SEAL, you don't check > the sysctl at all. > > This can be verified easily: > ----- > $ cat > memfd_exec.c <<'EOF' > #define _GNU_SOURCE > > #include <errno.h> > #include <stdio.h> > #include <sys/mman.h> > #include <sys/types.h> > #include <sys/wait.h> > > #ifndef MFD_EXEC > #define MFD_EXEC 0x0010U > #endif > > int main() { > int fd = memfd_create("script", MFD_EXEC); > if (fd == -1)l > perror("memfd"); > > char prog[] = "#!/bin/sh\necho Ran script\n"; > if (write(fd, prog, sizeof(prog)-1) != sizeof(prog)-1) > perror("write"); > > char *const argv[] = { "script", NULL }; > char *const envp[] = { NULL }; > fexecve(fd, argv, envp); > perror("fexecve"); > } > EOF > $ gcc -o memfd_exec memfd_exec.c > $ ./memfd_exec > Ran script > $ sysctl vm.memfd_noexec > vm.memfd_noexec = 2 > ----- > (as opposed to failing hard on memfd_create if flag unset on sysctl=2, > and failing on fexecve with flag unset and sysctl=1) > > What am I missing? > > At one point, I was thinking of having a security hook to block executable memfd [1], so this sysctl only works for the application that doesn't set EXEC bit. Now I think it makes sense to use vm.memfd_noexec = 2 to block the MFD_EXEC also. Anyway the commit msg says: 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected. Not doing that is a bug. I will send a fix for that. [1] https://lore.kernel.org/lkml/20221206150233.1963717-7-jeffxu@xxxxxxxxxx/ > > BTW I find the current behaviour rather hard to use: setting this to 2 > should still set NOEXEC by default in my opinion, just refuse anything > that explicitly requested EXEC. > At one point [2] (v2 of patch) there were two sysctls, one is doing overwrite, one is enforcing, later I decided with one sysctl, the rationale is the kernel will eventually get out of the business of overwriting user space code. Yes. It might take a long time to migrate all of the userspace. In the meantime, to meet what you want, the solution is keep vm.memfd_noexec = 1 (for overwrite), and a new security policy (SELInux or Landlock) that uses security hook security_memfd_create, this can block one process from creating executable memfd. Indeed, security policy is better fit to cases like this than sysctl. [2] https://lore.kernel.org/linux-mm/CABi2SkWGo9Jrd=i1e2PoDWYGenGhR=pG=yGsQP5VLmizTmg-iA@xxxxxxxxxxxxxx/ > Sure there's a warn_once that memfd_create was used without seal, but > right now on my system it's "used up" 5 seconds after boot by systemd: > [ 5.854378] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=1 'systemd' > > And anyway, older kernels will barf up EINVAL when calling memfd_create > with MFD_NOEXEC_SEAL, so even if userspace will want to adapt they'll > need to try calling memfd_create with the flag once and retry on EINVAL, > which let's face it is going to take a while to happen. > (Also, the flag has been added to glibc, but not in any release yet) > Yes. Application will need to do some detection of the kernel. This is not avoidable. > Making calls default to noexec AND refuse exec does what you want > (forbid use of exec in an app that wasn't in a namespace that allows > exec) while allowing apps that require it to work; that sounds better > than making all applications that haven't taken the pain of adding the > new flag to me. > Well, I guess an app that did require exec without setting the flag will > fail in a weird place instead of failing at memfd_create and having a > chance to fallback, so it's not like it doesn't make any sense; > I don't have such strong feelings about this if the sysctl works, but > for my use case I'm more likely to want to take a chance at memfd_create > not needing exec than having the flag set. Perhaps a third value if I > cared enough... > > -- > Dominique Martinet | Asmadeus Thanks -Jeff