Re: [RFC] Proposal: Static SECCOMP Policies

Maxwell Bland <mbland@xxxxxxxxxxxx> · Tue, 17 Sep 2024 11:54:17 -0500

On Tue, Sep 17, 2024 at 12:34:28AM GMT, Kees Cook wrote:
> On Thu, Sep 12, 2024 at 04:02:53PM +0000, Maxwell Bland wrote:
> > operated on around 0.1188 MB). But most importantly, third, without some degree
> > of provenance, I have no way of telling if someone has injected malicious code
> > into the kernel, and unfortunately even knowing the correct bytes is still
> > "iffy", as in order to prevent JIT spray attacks, each of these filters is
> > offset by some random number of uint32_t's, making every 4-byte shift of the
> > filter a "valid" codepage to be loaded at runtime.
> 
> So, let's start here. What I've seen from the thread is that there isn't
> a way to verify that a given JIT matches the cBPF. Is validating the
> cBPF itself also needed?

Yes(ish) but mostly no. Current kernel exploits, from what I have seen
and what is readily available consist of three stages:

- Find a UAF
- Bootstrap this UAF into an unconstrained read/write
- Modify some core kernel resource to get arbitrary execution.

Example dating back to 2019:
https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html

An adversary could modify the loaded cBPF program prior to loading in
order to, say, change the range of syscall _NR_'s accepted by the
seccomp switch statement in order to stage their escape from Chrome's
sandbox.

However, JIT presents a more general issue, hence the mostly no, since
and exploited native system service could target the JITed code page
in order to exploit the kernel, rather than requiring something to
be staged within the modified seccomp sandbox in the "cBPF itself"
example.

For example, Motorola has a few system services for hardware and other
things (as well as QCOM), written in C, for example, our native dropbox
agent. Supposing there were an exploit for this agent allowing execution
within that service's context, an adversary could find a UAF, and target
the page of Chrome's JITed seccomp filter in order to exploit the full
kernel. That is, they are not worried about escaping the sandbox so much
as finding a writable resource from which they can gain privileges in
the rest of the kernel.

Admitted, there are ~29,000 other writable data structures (in
msm-kernel) they could also target, but the JIT'ed seccomp filter is the
only code page they could modify (since it is not possible to get
compile-time provenance/signatures). The dilemma is that opposed to
modifying, say, the system_unbound_wq and adding an entry to it that
holds a pointer to call_usermodehelper_exec_work, you could add some
code to this page instead, making the kernel the same level of
exploitable.

The goal at the end of the day is to fix this and then try to build a
system to lock down the rest of the data in a sensible way. Likely an
ARM-MTE like, EL2-maintained tag system conditioned on the kernel's
scheduler and memory allocation infrastructure. At least, that is what I
want to be working on, after I figure out this seccomp stuff.

> - The IMA subsystem has wanted a way to measure (and validate) seccomp
>   filters. We could get more details from them for defining this need
>   more clearly.

You are right. I have added Mimi, Dmitry, and the integrity list. Their
work with linked lists and other data structures is right in line with
these concerns. I do not know if they have looked at building verifiers
for JIT'ed cBPF pages already.

> - The JIT needs to be verified against the cBPF that it was generated
>   from. We currently do only a single pass and don't validate it once
>   the region has been set read-only. We have a standing feature request
>   for improving this: https://github.com/KSPP/linux/issues/154
>
Kees, this is exactly what I'm talking about, you are awesome!

I'll share the (pretty straightforward) EL2 logic for this, though not the
code, since licensing and all that, but this public mailing list should
hopefully serve as prior art for any questionable chipset vendor attempting to
patent public domain security for the everyday person:

- Marking PTEs null is fine
- If a new PTE is allocated, mark it PXN atomically using the EL2
  permission fault failure triggered from the page table lockdown (see
  GPL-2.0 kernel module below).
- If a PTE is updated and the PXN bit is switched from 1 to 0, SHA256
  the page, mark it immutable, and let it through if it is OK.

This lets the page be mucked with during the whole JIT process, but ensures
that the second the page wants to be priv-executable, no further modifications
happen. To "unlock" the page for free-ing, one just needs to set the PXN bit
back. Then if we ever want to execute from it again, the process repeats, so
on. This relies on my prior main.c vmalloc maintenance and the below ptprotect
logic (note, WIP, no warranty on this code).

> For solutions, I didn't see much discussion around the "orig_prog"
> copy of the cBPF. Under CHECKPOINT_RESTORE, the original cBPF remains
> associated with the JIT. struct seccomp_filter's struct bpf_prog prog's
> orig_prog member. If it has value outside of CHECKPOINT_RESTORE, then
> we could do it for those conditions too.

Unfortunately the Android GKI does not support checkpoint restore and makes the
orig_prog reference fail (at least in the case I'm trying to work towards for
cell phones).

I could lock the orig_prog as immutable during the JIT, and given the resulting
code page, and then attempt to reproduce the code page in EL2 from the original
cBPF, but that seems dangerous and potentially buggy as opposed to checking the
reference addresses in the final machine code against knowledge of struct
seccomp_data (what I am working on right now).

Maxwell

// SPDX-License-Identifier: GPL-2.0
/*
 * Copyright (C) 2023 Motorola Mobility, Inc.
 *
 * Authors: Maxwell Bland
 * Binsheng "Sammy" Que
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License version 2 as
 * published by the Free Software Foundation.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * Initializes hypervisor-level protections for the kernel pagetables.  In
 * coordination with the moto_org_mem driver, which restricts executable code
 * pages to a well defined region in-between
 *
 * stext <-> module_alloc_base + SZ_2G
 *
 * It is able to mark all page tables not corresponding to this virtual address
 * range PXNTable. Mark the table these descriptors exist within as immutable.
 * For all tables/descriptors which are marked privileged executable, these are
 * marked permanently immutable, and their modifications are tracked directly.
 */
#ifndef _PTPROTECT_H
#define _PTPROTECT_H

#include <linux/delay.h>
#include <linux/highmem.h>
#include <linux/kprobes.h>
#include <linux/list.h>
#include <linux/mm_types.h>
#include <linux/module.h>
#include <linux/of.h>
#include <linux/of_platform.h>
#include <linux/pagewalk.h>
#include <linux/types.h>
#include <asm/pgalloc.h>
#include <asm/pgtable-hwdef.h>
#include <asm/pgtable.h>
#include <mm/pgalloc-track.h>
#include <trace/hooks/fault.h>
#include <trace/hooks/vendor_hooks.h>
#include <fs/erofs/compress.h>

uint64_t stext_vaddr = 0;
uint64_t etext_vaddr = 0;
uint64_t module_alloc_base_vaddr = 0;

uint64_t last_pmd_range[2] = { 0, 0 };
uint64_t pmd_range_list[1024][2] = { 0 };
int pmd_range_list_index = 0;

/**
 * add_to_pmd_range_list - adds a range to the pmd range list
 * @start: Start of the range
 * @end: End of the range
 *
 * Used to implement a naive set of adjacent pmd segments to 
 * speed up protection code as otherwise we will treat each
 * pmd (there are a lot of them, as a separate region to protect)
 */
static void add_to_pmd_range_list(uint64_t start, uint64_t end)
{
	pmd_range_list[pmd_range_list_index][0] = start;
	pmd_range_list[pmd_range_list_index][1] = end;
	pmd_range_list_index++;
}

void lock_last_pmd_range(void)
{
	if (last_pmd_range[0] == 0 || last_pmd_range[1] == 0)
		return;
	split_block(last_pmd_range[0]);
	mark_range_ro_smc(last_pmd_range[0], last_pmd_range[1],
			  KERN_PROT_PAGE_TABLE);
	msleep(10);
}

/**
 * prot_pmd_entry - protects a range pointed to by a pmd entry
 *
 * @pmd: Pointer to the pmd entry
 * @addr: Virtual address of the pmd entry
 */
static void prot_pmd_entry(pmd_t *pmd, unsigned long addr)
{
	uint64_t pgaddr = pmd_page_vaddr(*pmd);
	uint64_t start_range = 0;
	uint64_t end_range = 0;

	/*
         * Just found that QCOM's gic_intr_routing.c kernel module is getting
         * allocated at vaddr ffffffdb87f67000, but modules code region should
         * only be allocated from ffffffdb8fc00000 to ffffffdc0fdfffff...
         * 
         * It seems to be because arm64's module.h defines module_alloc_base as
         * ((u64)_etext - MODULES_VSIZE) But this module_alloc_base preprocesor
         * define should be redefined/randomized by kernel/kaslr.c, however, it
         * appears that early init modules get allocated before
         * module_alloc_base is relocated, so c'est la vie, and the efforts of
         * kaslr.c are for naught (_etext's vaddr is randomized though, so it
         * does not matter, I guess).
         */
	uint64_t module_alloc_start = module_alloc_base_vaddr;
	uint64_t module_alloc_end = module_alloc_base_vaddr + SZ_2G;

	if (!pmd_present(*pmd) || pmd_bad(*pmd) || pmd_none(*pmd) ||
	    !pmd_val(*pmd))
		return;

	/* Round the starts and ends of each region to their boundary limits */
	// module_alloc_start -= (module_alloc_start % PMD_SIZE);
	// module_alloc_end += PMD_SIZE - (module_alloc_end % PMD_SIZE) - 1;

	start_range = __virt_to_phys(pgaddr);
	end_range = __virt_to_phys(pgaddr) + sizeof(pte_t) * PTRS_PER_PMD - 1;

	/* If the PMD potentially points to code, check it in the hypervisor */
	if (!pmd_leaf(*pmd) &&
	    ((addr <= etext_vaddr && (addr + PMD_SIZE - 1) >= stext_vaddr) ||
	     (addr <= module_alloc_end &&
	      (addr + PMD_SIZE - 1) >= module_alloc_start))) {
		if (start_range == last_pmd_range[1] + 1) {
			last_pmd_range[1] = end_range;
		} else if (end_range + 1 == last_pmd_range[0]) {
			last_pmd_range[0] = start_range;
		} else if (last_pmd_range[0] == 0 && last_pmd_range[1] == 0) {
			last_pmd_range[0] = start_range;
			last_pmd_range[1] = end_range;
		} else {
			add_to_pmd_range_list(last_pmd_range[0],
					      last_pmd_range[1]);
			lock_last_pmd_range();
			last_pmd_range[0] = start_range;
			last_pmd_range[1] = end_range;
		}
		/* If the PMD points to data only, mark it PXN, as the caller will
                 * mark the PMD immutable after this function returns */
	} else {
		if (!pmd_leaf(*pmd)) {
			set_pmd(pmd, __pmd(pmd_val(*pmd) | PMD_TABLE_PXN));
		} else {
			/* TODO: if block, ensure range is marked immutable */
			pr_info("MotoRKP: pmd block at %llx\n", start_range);
		}
	}
}

pgd_t *swapper_pg_dir_ind;
void (*set_swapper_pgd_ind)(pgd_t *pgdp, pgd_t pgd);

static inline bool in_swapper_pgdir_ind(void *addr)
{
	return ((unsigned long)addr & PAGE_MASK) ==
	       ((unsigned long)swapper_pg_dir_ind & PAGE_MASK);
}

static inline void set_pgd_ind(pgd_t *pgdp, pgd_t pgd)
{
	if (in_swapper_pgdir_ind(pgdp)) {
		set_swapper_pgd_ind(pgdp, __pgd(pgd_val(pgd)));
		return;
	}

	WRITE_ONCE(*pgdp, pgd);
	dsb(ishst);
	isb();
}

/**
 * prot_pgd_entry - protects a range pointed to by a pgd entry
 * @pgd: pgd struct with descriptor values
 * @addr: vaddr of start of pgds referenced memory range
 */
static int prot_pgd_entry(pgd_t *pgd, unsigned long addr, unsigned long next,
			  struct mm_walk *walk)
{
	uint64_t pgaddr = pgd_page_vaddr(*pgd);
	uint64_t start_range = 0;
	uint64_t end_range = 0;
	uint64_t module_alloc_start = module_alloc_base_vaddr;
	uint64_t module_alloc_end = module_alloc_base_vaddr + SZ_2G;
	uint64_t i = 0;
	pmd_t *subdescriptor = 0;
	unsigned long subdescriptor_addr = addr;

	if (!pgd_present(*pgd) || pgd_bad(*pgd) || pgd_none(*pgd) ||
	    !pgd_val(*pgd))
		return 0;

	/* Round the starts and ends of each region to their boundary limits */
	// module_alloc_start -= (module_alloc_start % PGDIR_SIZE);
	// module_alloc_end += PGDIR_SIZE - (module_alloc_end % PGDIR_SIZE) - 1;

	if (!pgd_leaf(*pgd)) {
		start_range = __virt_to_phys(pgaddr);
		end_range = __virt_to_phys(pgaddr) +
			    sizeof(p4d_t) * PTRS_PER_PGD - 1;

		/* If the PGD contains addesses between stext_vaddr and etext_vaddr or
                 * module_alloc_base and module_alloc_base + SZ_2G, then do not mark it
                * PXN */
		if ((addr <= etext_vaddr &&
		     (addr + PGDIR_SIZE - 1) >= stext_vaddr) ||
		    (addr <= module_alloc_end &&
		     (addr + PGDIR_SIZE - 1) >= module_alloc_start)) {
			/* Protect all second-level PMD entries */
			for (i = 0; i < PTRS_PER_PGD; i++) {
				subdescriptor =
					(pmd_t *)(pgaddr + i * sizeof(pmd_t));
				prot_pmd_entry(subdescriptor,
					       subdescriptor_addr);
				subdescriptor_addr += PMD_SIZE;
			}
			lock_last_pmd_range();

			split_block(start_range);
			mark_range_ro_smc(start_range, end_range,
					  KERN_PROT_PAGE_TABLE);
		} else {
			/* Further modifications protected by immutability from hyp_rodata_end to __inittext_begin in kickoff */
			set_pgd_ind(pgd, __pgd(pgd_val(*pgd) | 1UL << 59));
		}
	} else {
		/* TODO: Handle block case at this level? */
		pr_info("MotoRKP: pgd block at %llx\n", start_range);
	}
	return 0;
}

/*
 * Locks down the ranges of memory pointed to by all PGDs as read-only.
 * Current kernel configurations do not bother with p4ds or puds, and
 * thus we do not need protections for these layers (pgd points directly
 * to pmd).
 */
static const struct mm_walk_ops protect_pgds = {
	.pgd_entry = prot_pgd_entry,
};

#endif /* _PTPROTECT_H */