On 22.10.2020 10.54, Szabolcs Nagy wrote:
The 10/21/2020 22:44, Jeremy Linton wrote:
There is a problem with glibc+systemd on BTI enabled systems. Systemd
has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
caught by the seccomp filter, resulting in service failures.
So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
This is obviously not desirable.
Various changes have been suggested, replacing the mprotect with mmap calls
having PROT_BTI set on the original mapping, re-mmapping the segments,
implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
and various modification to seccomp to allow particular mprotect cases to
bypass the filters. In each case there seems to be an undesirable attribute
to the solution.
So, whats the best solution?
the easy fix in glibc is to ignore mprotect(PROT_BTI|PROT_EXEC)
failures, so programs work with seccomp filters, but bti gets
disabled (it's unreasonable to expect bti protection if mprotect
is filtered). it will be a nasty silent failure though.
Some may also want to use seccomp filters so that they will immediately
kill the process and in this case they couldn't do it.
and i'm also considering a fix that re-mmaps the executable
segment with PROT_BTI instead of mprotect since that is not
filtered. unfortunately the main exe is mmaped by the kernel
without PROT_BTI and the libc does not have the fd to re-mmap.
(bti can be left off for the main exe if mprotect fails and
later we can teach the kernel to add bti there.) currently
this is not a complete fix so i'm a bit hesitant about it.
as for a kernel side fix: if there is a way to only filter
PROT_EXEC mprotect on mappings that are not yet PROT_EXEC
that would solve this problem (but likely needs new syscall
or seccomp capability).
Problem with seccomp MDWX is that it's still possible for malicious
programs to circumvent the filter by using memfd_create(), fill the
memory with desired content and then use mmap(,,PROT_EXEC) to make it
executable without triggering seccomp. This can be mitigated by
filtering also memfd_create(), but then some programs want to use it.
Also the protection can be bypassed if the program can write to a file
system which isn't mounted with "noexec". This can be mitigated with
private mount namespaces and global mount options, but again some
programs are written to expect W & X.
But I think SELinux has a more complete solution (execmem) which can
track the pages better than is possible with seccomp solution which has
a very narrow field of view. Maybe this facility could be made available
to non-SELinux systems, for example with prctl()? Then the in-kernel
MDWX could allow mprotect(PROT_EXEC | PROT_BTI) in case the backing file
hasn't been modified, the source filesystem isn't writable for the
calling process and the file descriptor isn't created with memfd_create().
-Topi
_______________________________________________
systemd-devel mailing list
systemd-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/systemd-devel