On 20.4.2022 16.01, Catalin Marinas wrote:
On Thu, Apr 14, 2022 at 11:52:17AM -0700, Kees Cook wrote:
On Wed, Apr 13, 2022 at 02:49:42PM +0100, Catalin Marinas wrote:
The background to this is that systemd has a configuration option called
MemoryDenyWriteExecute [1], implemented as a SECCOMP BPF filter. Its aim
is to prevent a user task from inadvertently creating an executable
mapping that is (or was) writeable. Since such BPF filter is stateless,
it cannot detect mappings that were previously writeable but
subsequently changed to read-only. Therefore the filter simply rejects
any mprotect(PROT_EXEC). The side-effect is that on arm64 with BTI
support (Branch Target Identification), the dynamic loader cannot change
an ELF section from PROT_EXEC to PROT_EXEC|PROT_BTI using mprotect().
For libraries, it can resort to unmapping and re-mapping but for the
main executable it does not have a file descriptor. The original bug
report in the Red Hat bugzilla - [2] - and subsequent glibc workaround
for libraries - [3].
Right, so, the systemd filter is a big hammer solution for the kernel
not having a very easy way to provide W^X mapping protections to
userspace. There's stuff in SELinux, and there have been several
attempts[1] at other LSMs to do it too, but nothing stuck.
Given the filter, and the implementation of how to enable BTI, I see two
solutions:
- provide a way to do W^X so systemd can implement the feature differently
- provide a way to turn on BTI separate from mprotect to bypass the filter
I would agree, the latter seems like the greater hack,
We discussed such hacks in the past but they are just working around the
fundamental issue - systemd wants W^X but with BPF it can only achieve
it by preventing mprotect(PROT_EXEC) irrespective of whether the mapping
was already executable. If we find a better solution for W^X, we
wouldn't have to hack anything for mprotect(PROT_EXEC|PROT_BTI).
so I welcome
this RFC, though I think it might need to explore a bit of the feature
space exposed by other solutions[1] (i.e. see SARA and NAX), otherwise
it risks being too narrowly implemented. For example, playing well with
JITs should be part of the design, and will likely need some kind of
ELF flags and/or "sealing" mode, and to handle the vma alias case as
Jann Horn pointed out[2].
I agree we should look at what we want to cover, though trying to avoid
re-inventing SELinux. With this patchset I went for the minimum that
systemd MDWE does with BPF.
I think JITs get around it using something like memfd with two separate
mappings to the same page. We could try to prevent such aliases but
allow it if an ELF note is detected (or get the JIT to issue a prctl()).
Anyway, with a prctl() we can allow finer-grained control starting with
anonymous and file mappings and later extending to vma aliases,
writeable files etc. On top we can add a seal mask so that a process
cannot disable a control was set. Something like (I'm not good at
names):
prctl(PR_MDWX_SET, flags, seal_mask);
prctl(PR_MDWX_GET);
with flags like:
PR_MDWX_MMAP - basics, should cover mmap() and mprotect()
PR_MDWX_ALIAS - vma aliases, allowed with an ELF note
PR_MDWX_WRITEABLE_FILE
(needs some more thinking)
For systemd, feature compatibility with the BPF version is important so
that we could automatically switch to the kernel version once available
without regressions. So I think PR_MDWX_MMAP (or maybe PR_MDWX_COMPAT)
should match exactly what MemoryDenyWriteExecute=yes as implemented with
BPF has: only forbid mmap(PROT_EXEC|PROT_WRITE) and mprotect(PROT_EXEC).
Like BPF, once installed there should be no way to escape and ELF flags
should be also ignored. ARM BTI should be allowed though (allow
PROT_EXEC|PROT_BTI if the old flags had PROT_EXEC).
Then we could have improved versions (other PR_MDWX_ prctls) with lots
more checks. This could be enabled with MemoryDenyWriteExecute=strict or so.
Perhaps also more relaxed versions (like SARA) could be interesting
(system service running Python with FFI, or perhaps JVM etc), enabled
with for example MemoryDenyWriteExecute=trampolines. That way even those
programs would get some protection (though there would be a gap in the
defences).
-Topi