On Thu, Sep 16, 2021 at 04:13:37PM +0100, Will Deacon wrote: > Hi Arnd, > > On Thu, Sep 16, 2021 at 04:46:15PM +0200, Arnd Bergmann wrote: > > On Thu, Sep 16, 2021 at 3:18 PM Will Deacon <will@xxxxxxxxxx> wrote: > > > > > > Distributions such as Android which support a mixture of 32-bit (compat) > > > and 64-bit (native) tasks necessarily ship with the compat ELF loader > > > enabled in their kernels. However, as time goes by, an ever-increasing > > > proportion of userspace consists of native applications and in some cases > > > 32-bit capabilities are starting to be removed from the CPUs altogether. > > > > > > Inevitably, this means that the compat code becomes somewhat of a > > > maintenance burden, receiving less testing coverage and exposing an > > > additional kernel attack surface to userspace during the lengthy > > > transitional period where some shipping devices require support for > > > 32-bit binaries. > > > > > > Introduce a new sysctl 'fs.compat-binfmt-elf-enable' to allow the compat > > > ELF loader to be disabled dynamically on devices where it is not required. > > > On arm64, this is sufficient to prevent userspace from executing 32-bit > > > code at all. > > > > > > Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > > > Cc: Andy Lutomirski <luto@xxxxxxxxxx> > > > Cc: Arnd Bergmann <arnd@xxxxxxxx> > > > Cc: Catalin Marinas <catalin.marinas@xxxxxxx> > > > Cc: Kees Cook <keescook@xxxxxxxxxxxx> > > > Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > > > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > > > Signed-off-by: Will Deacon <will@xxxxxxxxxx> > > > --- > > > fs/compat_binfmt_elf.c | 24 +++++++++++++++++++++++- > > > 1 file changed, 23 insertions(+), 1 deletion(-) > > > > > > I started off hacking this into the arch code, but then I realised it was > > > just as easy doing it in the core for everybody to enjoy. Unfortunately, > > > after talking to Peter, it sounds like it doesn't really help on x86 > > > where userspace can switch to 32-bit without involving the kernel at all. > > > > > > Thoughts? > > > > I'm not sure I understand the logic behind the sysctl. Are you worried > > about exposing attack surface on devices that don't support 32-bit > > instructions at all but might be tricked into loading a 32-bit binary that > > exploits a bug in the elf loader, or do you want to remove compat support > > on some but not all devices running the same kernel? > > It's the latter case. With the GKI effort in Android, we want to run the > same kernel binary across multiple devices. However, for some devices > we may be able to determine that there is no need to support 32-bit > applications even though the hardware may support them, and we would > like to ensure that things like the compat syscall wrappers, compat vDSO, > signal handling etc are not accessible to applications. I like the idea! I wonder if the binfmts should have an "enabled" flag instead? This would make it not compat_binfmt_elf-specific, and would avoid a new "special" sysfs flag: static bool enabled = 1; module_param(enabled, bool, 0600); MODULE_PARM_DESC(enabled, "Whether this binfmt available for loading"); Then: echo 0 > /sys/module/compat_binfmt_elf/enabled > > > In the first case, having the kernel make the decision based on CPU > > feature flags would be easier. In the second case, I would expect this > > to be a per-process setting similar to prctl, capability or seccomp. > > This would make it possible to do it for separately per container > > and avoid ambiguity about what happens to already-running 32-bit > > tasks. > > I'm not sure I follow the per-process aspect of your suggestion -- we want > to prevent 32-bit tasks from existing at all. If it wasn't for GKI, we'd > just disable CONFIG_COMPAT altogether, but while there is a need for 32-bit > support on some devices then we're not able to do that. It's possible to do process-hierarchy-controlled compat-restriction on all architectures with an seccomp ARCH test. For example: BPF_STMT(BPF_LD+BPF_W+BPF_ABS, arch_nr), BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, AUDIT_ARCH_X86_64, 1, 0), BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL_PROCESS) BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW) This filter will have fixed tiny overhead because of the automatic seccomp bitmaps. FWIW, systemd exposes this feature via "SystemCallArchitectures=native". This doesn't stop the loader attack surface, though, so I think something to control that makes sense. -- Kees Cook