On Mon, Jan 23, 2023 at 12:45:50PM +0100, David Hildenbrand wrote: > On 19.01.23 17:03, Joey Gouly wrote: > > The aim of such policy is to prevent a user task from creating an > > executable mapping that is also writeable. > > > > An example of mmap() returning -EACCESS if the policy is enabled: > > > > mmap(0, size, PROT_READ | PROT_WRITE | PROT_EXEC, flags, 0, 0); > > > > Similarly, mprotect() would return -EACCESS below: > > > > addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); > > mprotect(addr, size, PROT_READ | PROT_WRITE | PROT_EXEC); > > > > The BPF filter that systemd MDWE uses is stateless, and disallows > > mprotect() with PROT_EXEC completely. This new prctl allows PROT_EXEC to > > be enabled if it was already PROT_EXEC, which allows the following case: > > > > addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); > > mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI); > > > > where PROT_BTI enables branch tracking identification on arm64. > > > > Signed-off-by: Joey Gouly <joey.gouly@xxxxxxx> > > Co-developed-by: Catalin Marinas <catalin.marinas@xxxxxxx> > > Signed-off-by: Catalin Marinas <catalin.marinas@xxxxxxx> > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > > --- > > include/linux/mman.h | 34 ++++++++++++++++++++++++++++++++++ > > include/linux/sched/coredump.h | 6 +++++- > > include/uapi/linux/prctl.h | 6 ++++++ > > kernel/sys.c | 33 +++++++++++++++++++++++++++++++++ > > mm/mmap.c | 10 ++++++++++ > > mm/mprotect.c | 5 +++++ > > 6 files changed, 93 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/mman.h b/include/linux/mman.h > > index 58b3abd457a3..cee1e4b566d8 100644 > > --- a/include/linux/mman.h > > +++ b/include/linux/mman.h > > @@ -156,4 +156,38 @@ calc_vm_flag_bits(unsigned long flags) > > } > > unsigned long vm_commit_limit(void); > > + > > +/* > > + * Denies creating a writable executable mapping or gaining executable permissions. > > + * > > + * This denies the following: > > + * > > + * a) mmap(PROT_WRITE | PROT_EXEC) > > + * > > + * b) mmap(PROT_WRITE) > > + * mprotect(PROT_EXEC) > > + * > > + * c) mmap(PROT_WRITE) > > + * mprotect(PROT_READ) > > + * mprotect(PROT_EXEC) > > + * > > + * But allows the following: > > + * > > + * d) mmap(PROT_READ | PROT_EXEC) > > + * mmap(PROT_READ | PROT_EXEC | PROT_BTI) > > + */ > > Shouldn't we clear VM_MAYEXEC at mmap() time such that we cannot set VM_EXEC > anymore? In an ideal world, there would be no further mprotect changes > required. I don't think it works for this scenario. We don't want to disable PROT_EXEC entirely, only disallow it if the mapping is not already executable. The below should be allowed: addr = mmap(0, size, PROT_READ | PROT_EXEC, flags, 0, 0); mprotect(addr, size, PROT_READ | PROT_EXEC | PROT_BTI); but IIUC what you meant, it fails if we cleared VM_MAYEXEC at mmap() time. We could clear VM_MAYEXEC if the mapping was made VM_WRITE (either by mmap() or mprotect()) but IIRC we concluded that this should be an additional prctl() flag. This series aims to be pretty much a drop-in replacement for the systemd's MDWE SECCOMP feature (but allowing PROT_BTI). -- Catalin