Re: [RFC] Proposal: Static SECCOMP Policies

Maxwell Bland <mbland@xxxxxxxxxxxx> · Tue, 17 Sep 2024 12:01:34 -0500



+ Mimi, Dmitry, Integrity, FYI

On Tue, Sep 17, 2024 at 11:54:17AM GMT, Maxwell Bland wrote:
> On Tue, Sep 17, 2024 at 12:34:28AM GMT, Kees Cook wrote:
> > On Thu, Sep 12, 2024 at 04:02:53PM +0000, Maxwell Bland wrote:
> > > operated on around 0.1188 MB). But most importantly, third, without some degree
> > > of provenance, I have no way of telling if someone has injected malicious code
> > > into the kernel, and unfortunately even knowing the correct bytes is still
> > > "iffy", as in order to prevent JIT spray attacks, each of these filters is
> > > offset by some random number of uint32_t's, making every 4-byte shift of the
> > > filter a "valid" codepage to be loaded at runtime.
> > 
> > So, let's start here. What I've seen from the thread is that there isn't
> > a way to verify that a given JIT matches the cBPF. Is validating the
> > cBPF itself also needed?
> 
> Yes(ish) but mostly no. Current kernel exploits, from what I have seen
> and what is readily available consist of three stages:
> 
> - Find a UAF
> - Bootstrap this UAF into an unconstrained read/write
> - Modify some core kernel resource to get arbitrary execution.
> 
> Example dating back to 2019:
> https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html
> 
> An adversary could modify the loaded cBPF program prior to loading in
> order to, say, change the range of syscall _NR_'s accepted by the
> seccomp switch statement in order to stage their escape from Chrome's
> sandbox.
> 
> However, JIT presents a more general issue, hence the mostly no, since
> and exploited native system service could target the JITed code page
> in order to exploit the kernel, rather than requiring something to
> be staged within the modified seccomp sandbox in the "cBPF itself"
> example.
> 
> For example, Motorola has a few system services for hardware and other
> things (as well as QCOM), written in C, for example, our native dropbox
> agent. Supposing there were an exploit for this agent allowing execution
> within that service's context, an adversary could find a UAF, and target
> the page of Chrome's JITed seccomp filter in order to exploit the full
> kernel. That is, they are not worried about escaping the sandbox so much
> as finding a writable resource from which they can gain privileges in
> the rest of the kernel.
> 
> Admitted, there are ~29,000 other writable data structures (in
> msm-kernel) they could also target, but the JIT'ed seccomp filter is the
> only code page they could modify (since it is not possible to get
> compile-time provenance/signatures). The dilemma is that opposed to
> modifying, say, the system_unbound_wq and adding an entry to it that
> holds a pointer to call_usermodehelper_exec_work, you could add some
> code to this page instead, making the kernel the same level of
> exploitable.
> 
> The goal at the end of the day is to fix this and then try to build a
> system to lock down the rest of the data in a sensible way. Likely an
> ARM-MTE like, EL2-maintained tag system conditioned on the kernel's
> scheduler and memory allocation infrastructure. At least, that is what I
> want to be working on, after I figure out this seccomp stuff.
> 
> > - The IMA subsystem has wanted a way to measure (and validate) seccomp
> >   filters. We could get more details from them for defining this need
> >   more clearly.
> 
> You are right. I have added Mimi, Dmitry, and the integrity list. Their
> work with linked lists and other data structures is right in line with
> these concerns. I do not know if they have looked at building verifiers
> for JIT'ed cBPF pages already.
> 
> > - The JIT needs to be verified against the cBPF that it was generated
> >   from. We currently do only a single pass and don't validate it once
> >   the region has been set read-only. We have a standing feature request
> >   for improving this: https://github.com/KSPP/linux/issues/154
> >
> Kees, this is exactly what I'm talking about, you are awesome!
> 
> I'll share the (pretty straightforward) EL2 logic for this, though not the
> code, since licensing and all that, but this public mailing list should
> hopefully serve as prior art for any questionable chipset vendor attempting to
> patent public domain security for the everyday person:
> 
> - Marking PTEs null is fine
> - If a new PTE is allocated, mark it PXN atomically using the EL2
>   permission fault failure triggered from the page table lockdown (see
>   GPL-2.0 kernel module below).
> - If a PTE is updated and the PXN bit is switched from 1 to 0, SHA256
>   the page, mark it immutable, and let it through if it is OK.
> 
> This lets the page be mucked with during the whole JIT process, but ensures
> that the second the page wants to be priv-executable, no further modifications
> happen. To "unlock" the page for free-ing, one just needs to set the PXN bit
> back. Then if we ever want to execute from it again, the process repeats, so
> on. This relies on my prior main.c vmalloc maintenance and the below ptprotect
> logic (note, WIP, no warranty on this code).
> 
> > For solutions, I didn't see much discussion around the "orig_prog"
> > copy of the cBPF. Under CHECKPOINT_RESTORE, the original cBPF remains
> > associated with the JIT. struct seccomp_filter's struct bpf_prog prog's
> > orig_prog member. If it has value outside of CHECKPOINT_RESTORE, then
> > we could do it for those conditions too.
> 
> Unfortunately the Android GKI does not support checkpoint restore and makes the
> orig_prog reference fail (at least in the case I'm trying to work towards for
> cell phones).
> 
> I could lock the orig_prog as immutable during the JIT, and given the resulting
> code page, and then attempt to reproduce the code page in EL2 from the original
> cBPF, but that seems dangerous and potentially buggy as opposed to checking the
> reference addresses in the final machine code against knowledge of struct
> seccomp_data (what I am working on right now).
> 
> Maxwell
> 
> // SPDX-License-Identifier: GPL-2.0
> /*
>  * Copyright (C) 2023 Motorola Mobility, Inc.
>  *
>  * Authors: Maxwell Bland
>  * Binsheng "Sammy" Que
>  *
>  * This program is free software; you can redistribute it and/or modify
>  * it under the terms of the GNU General Public License version 2 as
>  * published by the Free Software Foundation.
>  *
>  * This program is distributed in the hope that it will be useful,
>  * but WITHOUT ANY WARRANTY; without even the implied warranty of
>  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>  * GNU General Public License for more details.
>  *
>  * Initializes hypervisor-level protections for the kernel pagetables.  In
>  * coordination with the moto_org_mem driver, which restricts executable code
>  * pages to a well defined region in-between
>  *
>  * stext <-> module_alloc_base + SZ_2G
>  *
>  * It is able to mark all page tables not corresponding to this virtual address
>  * range PXNTable. Mark the table these descriptors exist within as immutable.
>  * For all tables/descriptors which are marked privileged executable, these are
>  * marked permanently immutable, and their modifications are tracked directly.
>  */
> #ifndef _PTPROTECT_H
> #define _PTPROTECT_H
> 
> #include <linux/delay.h>
> #include <linux/highmem.h>
> #include <linux/kprobes.h>
> #include <linux/list.h>
> #include <linux/mm_types.h>
> #include <linux/module.h>
> #include <linux/of.h>
> #include <linux/of_platform.h>
> #include <linux/pagewalk.h>
> #include <linux/types.h>
> #include <asm/pgalloc.h>
> #include <asm/pgtable-hwdef.h>
> #include <asm/pgtable.h>
> #include <mm/pgalloc-track.h>
> #include <trace/hooks/fault.h>
> #include <trace/hooks/vendor_hooks.h>
> #include <fs/erofs/compress.h>
> 
> uint64_t stext_vaddr = 0;
> uint64_t etext_vaddr = 0;
> uint64_t module_alloc_base_vaddr = 0;
> 
> uint64_t last_pmd_range[2] = { 0, 0 };
> uint64_t pmd_range_list[1024][2] = { 0 };
> int pmd_range_list_index = 0;
> 
> /**
>  * add_to_pmd_range_list - adds a range to the pmd range list
>  * @start: Start of the range
>  * @end: End of the range
>  *
>  * Used to implement a naive set of adjacent pmd segments to 
>  * speed up protection code as otherwise we will treat each
>  * pmd (there are a lot of them, as a separate region to protect)
>  */
> static void add_to_pmd_range_list(uint64_t start, uint64_t end)
> {
> 	pmd_range_list[pmd_range_list_index][0] = start;
> 	pmd_range_list[pmd_range_list_index][1] = end;
> 	pmd_range_list_index++;
> }
> 
> void lock_last_pmd_range(void)
> {
> 	if (last_pmd_range[0] == 0 || last_pmd_range[1] == 0)
> 		return;
> 	split_block(last_pmd_range[0]);
> 	mark_range_ro_smc(last_pmd_range[0], last_pmd_range[1],
> 			  KERN_PROT_PAGE_TABLE);
> 	msleep(10);
> }
> 
> /**
>  * prot_pmd_entry - protects a range pointed to by a pmd entry
>  *
>  * @pmd: Pointer to the pmd entry
>  * @addr: Virtual address of the pmd entry
>  */
> static void prot_pmd_entry(pmd_t *pmd, unsigned long addr)
> {
> 	uint64_t pgaddr = pmd_page_vaddr(*pmd);
> 	uint64_t start_range = 0;
> 	uint64_t end_range = 0;
> 
> 	/*
>          * Just found that QCOM's gic_intr_routing.c kernel module is getting
>          * allocated at vaddr ffffffdb87f67000, but modules code region should
>          * only be allocated from ffffffdb8fc00000 to ffffffdc0fdfffff...
>          * 
>          * It seems to be because arm64's module.h defines module_alloc_base as
>          * ((u64)_etext - MODULES_VSIZE) But this module_alloc_base preprocesor
>          * define should be redefined/randomized by kernel/kaslr.c, however, it
>          * appears that early init modules get allocated before
>          * module_alloc_base is relocated, so c'est la vie, and the efforts of
>          * kaslr.c are for naught (_etext's vaddr is randomized though, so it
>          * does not matter, I guess).
>          */
> 	uint64_t module_alloc_start = module_alloc_base_vaddr;
> 	uint64_t module_alloc_end = module_alloc_base_vaddr + SZ_2G;
> 
> 	if (!pmd_present(*pmd) || pmd_bad(*pmd) || pmd_none(*pmd) ||
> 	    !pmd_val(*pmd))
> 		return;
> 
> 	/* Round the starts and ends of each region to their boundary limits */
> 	// module_alloc_start -= (module_alloc_start % PMD_SIZE);
> 	// module_alloc_end += PMD_SIZE - (module_alloc_end % PMD_SIZE) - 1;
> 
> 	start_range = __virt_to_phys(pgaddr);
> 	end_range = __virt_to_phys(pgaddr) + sizeof(pte_t) * PTRS_PER_PMD - 1;
> 
> 	/* If the PMD potentially points to code, check it in the hypervisor */
> 	if (!pmd_leaf(*pmd) &&
> 	    ((addr <= etext_vaddr && (addr + PMD_SIZE - 1) >= stext_vaddr) ||
> 	     (addr <= module_alloc_end &&
> 	      (addr + PMD_SIZE - 1) >= module_alloc_start))) {
> 		if (start_range == last_pmd_range[1] + 1) {
> 			last_pmd_range[1] = end_range;
> 		} else if (end_range + 1 == last_pmd_range[0]) {
> 			last_pmd_range[0] = start_range;
> 		} else if (last_pmd_range[0] == 0 && last_pmd_range[1] == 0) {
> 			last_pmd_range[0] = start_range;
> 			last_pmd_range[1] = end_range;
> 		} else {
> 			add_to_pmd_range_list(last_pmd_range[0],
> 					      last_pmd_range[1]);
> 			lock_last_pmd_range();
> 			last_pmd_range[0] = start_range;
> 			last_pmd_range[1] = end_range;
> 		}
> 		/* If the PMD points to data only, mark it PXN, as the caller will
>                  * mark the PMD immutable after this function returns */
> 	} else {
> 		if (!pmd_leaf(*pmd)) {
> 			set_pmd(pmd, __pmd(pmd_val(*pmd) | PMD_TABLE_PXN));
> 		} else {
> 			/* TODO: if block, ensure range is marked immutable */
> 			pr_info("MotoRKP: pmd block at %llx\n", start_range);
> 		}
> 	}
> }
> 
> pgd_t *swapper_pg_dir_ind;
> void (*set_swapper_pgd_ind)(pgd_t *pgdp, pgd_t pgd);
> 
> static inline bool in_swapper_pgdir_ind(void *addr)
> {
> 	return ((unsigned long)addr & PAGE_MASK) ==
> 	       ((unsigned long)swapper_pg_dir_ind & PAGE_MASK);
> }
> 
> static inline void set_pgd_ind(pgd_t *pgdp, pgd_t pgd)
> {
> 	if (in_swapper_pgdir_ind(pgdp)) {
> 		set_swapper_pgd_ind(pgdp, __pgd(pgd_val(pgd)));
> 		return;
> 	}
> 
> 	WRITE_ONCE(*pgdp, pgd);
> 	dsb(ishst);
> 	isb();
> }
> 
> /**
>  * prot_pgd_entry - protects a range pointed to by a pgd entry
>  * @pgd: pgd struct with descriptor values
>  * @addr: vaddr of start of pgds referenced memory range
>  */
> static int prot_pgd_entry(pgd_t *pgd, unsigned long addr, unsigned long next,
> 			  struct mm_walk *walk)
> {
> 	uint64_t pgaddr = pgd_page_vaddr(*pgd);
> 	uint64_t start_range = 0;
> 	uint64_t end_range = 0;
> 	uint64_t module_alloc_start = module_alloc_base_vaddr;
> 	uint64_t module_alloc_end = module_alloc_base_vaddr + SZ_2G;
> 	uint64_t i = 0;
> 	pmd_t *subdescriptor = 0;
> 	unsigned long subdescriptor_addr = addr;
> 
> 	if (!pgd_present(*pgd) || pgd_bad(*pgd) || pgd_none(*pgd) ||
> 	    !pgd_val(*pgd))
> 		return 0;
> 
> 	/* Round the starts and ends of each region to their boundary limits */
> 	// module_alloc_start -= (module_alloc_start % PGDIR_SIZE);
> 	// module_alloc_end += PGDIR_SIZE - (module_alloc_end % PGDIR_SIZE) - 1;
> 
> 	if (!pgd_leaf(*pgd)) {
> 		start_range = __virt_to_phys(pgaddr);
> 		end_range = __virt_to_phys(pgaddr) +
> 			    sizeof(p4d_t) * PTRS_PER_PGD - 1;
> 
> 		/* If the PGD contains addesses between stext_vaddr and etext_vaddr or
>                  * module_alloc_base and module_alloc_base + SZ_2G, then do not mark it
>                 * PXN */
> 		if ((addr <= etext_vaddr &&
> 		     (addr + PGDIR_SIZE - 1) >= stext_vaddr) ||
> 		    (addr <= module_alloc_end &&
> 		     (addr + PGDIR_SIZE - 1) >= module_alloc_start)) {
> 			/* Protect all second-level PMD entries */
> 			for (i = 0; i < PTRS_PER_PGD; i++) {
> 				subdescriptor =
> 					(pmd_t *)(pgaddr + i * sizeof(pmd_t));
> 				prot_pmd_entry(subdescriptor,
> 					       subdescriptor_addr);
> 				subdescriptor_addr += PMD_SIZE;
> 			}
> 			lock_last_pmd_range();
> 
> 			split_block(start_range);
> 			mark_range_ro_smc(start_range, end_range,
> 					  KERN_PROT_PAGE_TABLE);
> 		} else {
> 			/* Further modifications protected by immutability from hyp_rodata_end to __inittext_begin in kickoff */
> 			set_pgd_ind(pgd, __pgd(pgd_val(*pgd) | 1UL << 59));
> 		}
> 	} else {
> 		/* TODO: Handle block case at this level? */
> 		pr_info("MotoRKP: pgd block at %llx\n", start_range);
> 	}
> 	return 0;
> }
> 
> /*
>  * Locks down the ranges of memory pointed to by all PGDs as read-only.
>  * Current kernel configurations do not bother with p4ds or puds, and
>  * thus we do not need protections for these layers (pgd points directly
>  * to pmd).
>  */
> static const struct mm_walk_ops protect_pgds = {
> 	.pgd_entry = prot_pgd_entry,
> };
> 
> #endif /* _PTPROTECT_H */
>