Re: [PATCH 0/4] aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831]

Jeremy Linton <jeremy.linton@xxxxxxx> · Wed, 4 Nov 2020 12:59:45 -0600

Hi,

On 11/4/20 9:20 AM, Mark Rutland wrote:
On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote:
On 4.11.2020 11.29, Florian Weimer wrote:
* Will Deacon:

Is there real value in this seccomp filter if it only looks at mprotect(),
or was it just implemented because it's easy to do and sounds like a good
idea?

It seems bogus to me.  Everyone will just create alias mappings instead,
just like they did for the similar SELinux feature.  See “Example code
to avoid execmem violations” in:

    <https://www.akkadia.org/drepper/selinux-mem.html>

Also note "But this is very dangerous: programs should never use memory
regions which are writable and executable at the same time. Assuming that it
is really necessary to generate executable code while the program runs the
method employed should be reconsidered."

Sure, and to be clear we're not trying to violate the "at the same time"
property. We do not want to permit simultaneous PROT_WRITE and PROT_EXEC
at any instant in time. What we're asking is to not block changing
permissions to PROT_EXEC in the absence of PROT_WRITE.

I think that the goal of preventing WRITE -> EXEC transitions for some
memory is sane, but I think the existing kernel primitives available to
systemd don't allow us to do that in a robust way because we don't have
all the relevant state tracked and accessible, and the existing approach
gets in the way of doing the right thing for other mitigations.

Consequently I think it would be better going forward to add a more
robust (kernel) mechanism for enforcement that can distinguish
WRITE->EXEC from EXEC->EXEC+BTI, and e.g. can be used to forbid aliasing
mappings with differing W/X permissions. Then userspace could eventually
transition over to that and get /stronger/ protection while permitting
the BTI case we'd like to work now.

If a service legitimately needs executable and writable mappings (due to
JIT, trampolines etc), it's easy to disable the filter whenever really
needed with "MemoryDenyWriteExecute=no" (which is the default) in case of
systemd or a TE rule like "allow type_t self:process { execmem };" for
SELinux. But this shouldn't be the default case, since there are many
services which don't need W&X.

I'd also question what is the value of BTI if it can be easily circumvented
by removing PROT_BTI with mprotect()?

I agree that turning BTI off is a concern, and to that end I'd like to
add an enforcement mechanism whereby we could prevent that (ideally the
same mechanism by which we could prevent WRITE -> EXEC transitions).

But, as with all things it's a matter of degree. MDWE and BTI are both
hurdles to an adversary, but neither are absolutes and there are
approaches to bypass either. By the time someone's issuing mprotect()
with an arbitrary VA and/or prot, they are liable to have been able to
do the same with mmap() and circumvent MDWE.

I'd really like to not have BTI silently disabled in order to work with
MDWE, because the risk is that it gets silently disabled elsewhere. The
risk of the changing the kernel to enable BTI for a binary is not well
known since we don't control other peoples libraries that might end up
not being compatible somehow with that. The risk of disabling a portion
of the MDWE protections seems to be the least out of the options we have
available, as unfortunate as it seems, and I think we can come up with a
better MDWE approach going forward.

OTOH, You don't really want to blanket disable either protection, and 
unfortunately  you can't really tell until its too late if the service 
is fully BTI enabled. So you either end up disabling MDWE unnecessarily, 
or you delay until the only choice is not enabling BTI.

I guess there is another option too, which is some kind of delayed MDWE 
policy that only turns on once the service has started, but that isn't 
ideal either.

.

Thanks,
Mark.