+ Mimi, Dmitry, Integrity, FYI On Tue, Sep 17, 2024 at 11:54:17AM GMT, Maxwell Bland wrote: > On Tue, Sep 17, 2024 at 12:34:28AM GMT, Kees Cook wrote: > > On Thu, Sep 12, 2024 at 04:02:53PM +0000, Maxwell Bland wrote: > > > operated on around 0.1188 MB). But most importantly, third, without some degree > > > of provenance, I have no way of telling if someone has injected malicious code > > > into the kernel, and unfortunately even knowing the correct bytes is still > > > "iffy", as in order to prevent JIT spray attacks, each of these filters is > > > offset by some random number of uint32_t's, making every 4-byte shift of the > > > filter a "valid" codepage to be loaded at runtime. > > > > So, let's start here. What I've seen from the thread is that there isn't > > a way to verify that a given JIT matches the cBPF. Is validating the > > cBPF itself also needed? > > Yes(ish) but mostly no. Current kernel exploits, from what I have seen > and what is readily available consist of three stages: > > - Find a UAF > - Bootstrap this UAF into an unconstrained read/write > - Modify some core kernel resource to get arbitrary execution. > > Example dating back to 2019: > https://googleprojectzero.blogspot.com/2019/11/bad-binder-android-in-wild-exploit.html > > An adversary could modify the loaded cBPF program prior to loading in > order to, say, change the range of syscall _NR_'s accepted by the > seccomp switch statement in order to stage their escape from Chrome's > sandbox. > > However, JIT presents a more general issue, hence the mostly no, since > and exploited native system service could target the JITed code page > in order to exploit the kernel, rather than requiring something to > be staged within the modified seccomp sandbox in the "cBPF itself" > example. > > For example, Motorola has a few system services for hardware and other > things (as well as QCOM), written in C, for example, our native dropbox > agent. Supposing there were an exploit for this agent allowing execution > within that service's context, an adversary could find a UAF, and target > the page of Chrome's JITed seccomp filter in order to exploit the full > kernel. That is, they are not worried about escaping the sandbox so much > as finding a writable resource from which they can gain privileges in > the rest of the kernel. > > Admitted, there are ~29,000 other writable data structures (in > msm-kernel) they could also target, but the JIT'ed seccomp filter is the > only code page they could modify (since it is not possible to get > compile-time provenance/signatures). The dilemma is that opposed to > modifying, say, the system_unbound_wq and adding an entry to it that > holds a pointer to call_usermodehelper_exec_work, you could add some > code to this page instead, making the kernel the same level of > exploitable. > > The goal at the end of the day is to fix this and then try to build a > system to lock down the rest of the data in a sensible way. Likely an > ARM-MTE like, EL2-maintained tag system conditioned on the kernel's > scheduler and memory allocation infrastructure. At least, that is what I > want to be working on, after I figure out this seccomp stuff. > > > - The IMA subsystem has wanted a way to measure (and validate) seccomp > > filters. We could get more details from them for defining this need > > more clearly. > > You are right. I have added Mimi, Dmitry, and the integrity list. Their > work with linked lists and other data structures is right in line with > these concerns. I do not know if they have looked at building verifiers > for JIT'ed cBPF pages already. > > > - The JIT needs to be verified against the cBPF that it was generated > > from. We currently do only a single pass and don't validate it once > > the region has been set read-only. We have a standing feature request > > for improving this: https://github.com/KSPP/linux/issues/154 > > > Kees, this is exactly what I'm talking about, you are awesome! > > I'll share the (pretty straightforward) EL2 logic for this, though not the > code, since licensing and all that, but this public mailing list should > hopefully serve as prior art for any questionable chipset vendor attempting to > patent public domain security for the everyday person: > > - Marking PTEs null is fine > - If a new PTE is allocated, mark it PXN atomically using the EL2 > permission fault failure triggered from the page table lockdown (see > GPL-2.0 kernel module below). > - If a PTE is updated and the PXN bit is switched from 1 to 0, SHA256 > the page, mark it immutable, and let it through if it is OK. > > This lets the page be mucked with during the whole JIT process, but ensures > that the second the page wants to be priv-executable, no further modifications > happen. To "unlock" the page for free-ing, one just needs to set the PXN bit > back. Then if we ever want to execute from it again, the process repeats, so > on. This relies on my prior main.c vmalloc maintenance and the below ptprotect > logic (note, WIP, no warranty on this code). > > > For solutions, I didn't see much discussion around the "orig_prog" > > copy of the cBPF. Under CHECKPOINT_RESTORE, the original cBPF remains > > associated with the JIT. struct seccomp_filter's struct bpf_prog prog's > > orig_prog member. If it has value outside of CHECKPOINT_RESTORE, then > > we could do it for those conditions too. > > Unfortunately the Android GKI does not support checkpoint restore and makes the > orig_prog reference fail (at least in the case I'm trying to work towards for > cell phones). > > I could lock the orig_prog as immutable during the JIT, and given the resulting > code page, and then attempt to reproduce the code page in EL2 from the original > cBPF, but that seems dangerous and potentially buggy as opposed to checking the > reference addresses in the final machine code against knowledge of struct > seccomp_data (what I am working on right now). > > Maxwell > > // SPDX-License-Identifier: GPL-2.0 > /* > * Copyright (C) 2023 Motorola Mobility, Inc. > * > * Authors: Maxwell Bland > * Binsheng "Sammy" Que > * > * This program is free software; you can redistribute it and/or modify > * it under the terms of the GNU General Public License version 2 as > * published by the Free Software Foundation. > * > * This program is distributed in the hope that it will be useful, > * but WITHOUT ANY WARRANTY; without even the implied warranty of > * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > * GNU General Public License for more details. > * > * Initializes hypervisor-level protections for the kernel pagetables. In > * coordination with the moto_org_mem driver, which restricts executable code > * pages to a well defined region in-between > * > * stext <-> module_alloc_base + SZ_2G > * > * It is able to mark all page tables not corresponding to this virtual address > * range PXNTable. Mark the table these descriptors exist within as immutable. > * For all tables/descriptors which are marked privileged executable, these are > * marked permanently immutable, and their modifications are tracked directly. > */ > #ifndef _PTPROTECT_H > #define _PTPROTECT_H > > #include <linux/delay.h> > #include <linux/highmem.h> > #include <linux/kprobes.h> > #include <linux/list.h> > #include <linux/mm_types.h> > #include <linux/module.h> > #include <linux/of.h> > #include <linux/of_platform.h> > #include <linux/pagewalk.h> > #include <linux/types.h> > #include <asm/pgalloc.h> > #include <asm/pgtable-hwdef.h> > #include <asm/pgtable.h> > #include <mm/pgalloc-track.h> > #include <trace/hooks/fault.h> > #include <trace/hooks/vendor_hooks.h> > #include <fs/erofs/compress.h> > > uint64_t stext_vaddr = 0; > uint64_t etext_vaddr = 0; > uint64_t module_alloc_base_vaddr = 0; > > uint64_t last_pmd_range[2] = { 0, 0 }; > uint64_t pmd_range_list[1024][2] = { 0 }; > int pmd_range_list_index = 0; > > /** > * add_to_pmd_range_list - adds a range to the pmd range list > * @start: Start of the range > * @end: End of the range > * > * Used to implement a naive set of adjacent pmd segments to > * speed up protection code as otherwise we will treat each > * pmd (there are a lot of them, as a separate region to protect) > */ > static void add_to_pmd_range_list(uint64_t start, uint64_t end) > { > pmd_range_list[pmd_range_list_index][0] = start; > pmd_range_list[pmd_range_list_index][1] = end; > pmd_range_list_index++; > } > > void lock_last_pmd_range(void) > { > if (last_pmd_range[0] == 0 || last_pmd_range[1] == 0) > return; > split_block(last_pmd_range[0]); > mark_range_ro_smc(last_pmd_range[0], last_pmd_range[1], > KERN_PROT_PAGE_TABLE); > msleep(10); > } > > /** > * prot_pmd_entry - protects a range pointed to by a pmd entry > * > * @pmd: Pointer to the pmd entry > * @addr: Virtual address of the pmd entry > */ > static void prot_pmd_entry(pmd_t *pmd, unsigned long addr) > { > uint64_t pgaddr = pmd_page_vaddr(*pmd); > uint64_t start_range = 0; > uint64_t end_range = 0; > > /* > * Just found that QCOM's gic_intr_routing.c kernel module is getting > * allocated at vaddr ffffffdb87f67000, but modules code region should > * only be allocated from ffffffdb8fc00000 to ffffffdc0fdfffff... > * > * It seems to be because arm64's module.h defines module_alloc_base as > * ((u64)_etext - MODULES_VSIZE) But this module_alloc_base preprocesor > * define should be redefined/randomized by kernel/kaslr.c, however, it > * appears that early init modules get allocated before > * module_alloc_base is relocated, so c'est la vie, and the efforts of > * kaslr.c are for naught (_etext's vaddr is randomized though, so it > * does not matter, I guess). > */ > uint64_t module_alloc_start = module_alloc_base_vaddr; > uint64_t module_alloc_end = module_alloc_base_vaddr + SZ_2G; > > if (!pmd_present(*pmd) || pmd_bad(*pmd) || pmd_none(*pmd) || > !pmd_val(*pmd)) > return; > > /* Round the starts and ends of each region to their boundary limits */ > // module_alloc_start -= (module_alloc_start % PMD_SIZE); > // module_alloc_end += PMD_SIZE - (module_alloc_end % PMD_SIZE) - 1; > > start_range = __virt_to_phys(pgaddr); > end_range = __virt_to_phys(pgaddr) + sizeof(pte_t) * PTRS_PER_PMD - 1; > > /* If the PMD potentially points to code, check it in the hypervisor */ > if (!pmd_leaf(*pmd) && > ((addr <= etext_vaddr && (addr + PMD_SIZE - 1) >= stext_vaddr) || > (addr <= module_alloc_end && > (addr + PMD_SIZE - 1) >= module_alloc_start))) { > if (start_range == last_pmd_range[1] + 1) { > last_pmd_range[1] = end_range; > } else if (end_range + 1 == last_pmd_range[0]) { > last_pmd_range[0] = start_range; > } else if (last_pmd_range[0] == 0 && last_pmd_range[1] == 0) { > last_pmd_range[0] = start_range; > last_pmd_range[1] = end_range; > } else { > add_to_pmd_range_list(last_pmd_range[0], > last_pmd_range[1]); > lock_last_pmd_range(); > last_pmd_range[0] = start_range; > last_pmd_range[1] = end_range; > } > /* If the PMD points to data only, mark it PXN, as the caller will > * mark the PMD immutable after this function returns */ > } else { > if (!pmd_leaf(*pmd)) { > set_pmd(pmd, __pmd(pmd_val(*pmd) | PMD_TABLE_PXN)); > } else { > /* TODO: if block, ensure range is marked immutable */ > pr_info("MotoRKP: pmd block at %llx\n", start_range); > } > } > } > > pgd_t *swapper_pg_dir_ind; > void (*set_swapper_pgd_ind)(pgd_t *pgdp, pgd_t pgd); > > static inline bool in_swapper_pgdir_ind(void *addr) > { > return ((unsigned long)addr & PAGE_MASK) == > ((unsigned long)swapper_pg_dir_ind & PAGE_MASK); > } > > static inline void set_pgd_ind(pgd_t *pgdp, pgd_t pgd) > { > if (in_swapper_pgdir_ind(pgdp)) { > set_swapper_pgd_ind(pgdp, __pgd(pgd_val(pgd))); > return; > } > > WRITE_ONCE(*pgdp, pgd); > dsb(ishst); > isb(); > } > > /** > * prot_pgd_entry - protects a range pointed to by a pgd entry > * @pgd: pgd struct with descriptor values > * @addr: vaddr of start of pgds referenced memory range > */ > static int prot_pgd_entry(pgd_t *pgd, unsigned long addr, unsigned long next, > struct mm_walk *walk) > { > uint64_t pgaddr = pgd_page_vaddr(*pgd); > uint64_t start_range = 0; > uint64_t end_range = 0; > uint64_t module_alloc_start = module_alloc_base_vaddr; > uint64_t module_alloc_end = module_alloc_base_vaddr + SZ_2G; > uint64_t i = 0; > pmd_t *subdescriptor = 0; > unsigned long subdescriptor_addr = addr; > > if (!pgd_present(*pgd) || pgd_bad(*pgd) || pgd_none(*pgd) || > !pgd_val(*pgd)) > return 0; > > /* Round the starts and ends of each region to their boundary limits */ > // module_alloc_start -= (module_alloc_start % PGDIR_SIZE); > // module_alloc_end += PGDIR_SIZE - (module_alloc_end % PGDIR_SIZE) - 1; > > if (!pgd_leaf(*pgd)) { > start_range = __virt_to_phys(pgaddr); > end_range = __virt_to_phys(pgaddr) + > sizeof(p4d_t) * PTRS_PER_PGD - 1; > > /* If the PGD contains addesses between stext_vaddr and etext_vaddr or > * module_alloc_base and module_alloc_base + SZ_2G, then do not mark it > * PXN */ > if ((addr <= etext_vaddr && > (addr + PGDIR_SIZE - 1) >= stext_vaddr) || > (addr <= module_alloc_end && > (addr + PGDIR_SIZE - 1) >= module_alloc_start)) { > /* Protect all second-level PMD entries */ > for (i = 0; i < PTRS_PER_PGD; i++) { > subdescriptor = > (pmd_t *)(pgaddr + i * sizeof(pmd_t)); > prot_pmd_entry(subdescriptor, > subdescriptor_addr); > subdescriptor_addr += PMD_SIZE; > } > lock_last_pmd_range(); > > split_block(start_range); > mark_range_ro_smc(start_range, end_range, > KERN_PROT_PAGE_TABLE); > } else { > /* Further modifications protected by immutability from hyp_rodata_end to __inittext_begin in kickoff */ > set_pgd_ind(pgd, __pgd(pgd_val(*pgd) | 1UL << 59)); > } > } else { > /* TODO: Handle block case at this level? */ > pr_info("MotoRKP: pgd block at %llx\n", start_range); > } > return 0; > } > > /* > * Locks down the ranges of memory pointed to by all PGDs as read-only. > * Current kernel configurations do not bother with p4ds or puds, and > * thus we do not need protections for these layers (pgd points directly > * to pmd). > */ > static const struct mm_walk_ops protect_pgds = { > .pgd_entry = prot_pgd_entry, > }; > > #endif /* _PTPROTECT_H */ >