Dear Linux Hardening, Security, and Memory Management Mailing Lists, This is primarily an FYI and an RFC. I have some code, included below, that could be dropped into a *.ko for the 6.1.X kernel, but really this mail is to query about ideas for acceptable upstream changes. Thank you ahead of time for reading! If the title alone of this email sticks out and makes sense immediately, feel free to skip the introduction below. INTRODUCTION For the past few months, I have been sparring with recent CVE PoCs in the kernel, applying monkey patches to dynamic data structure allocations, attempting to prevent data-only attacks which use write gadgets to modify dynamically allocated struct fields otherwise declared constant. I wanted to share, briefly, what I feel is a reasonable and general solution to the standard contemporary exploit procedure. For those unfamiliar with recent PoC's, see a case study of recent exploits in Man Yue Mo's article here: https://github.blog/security/vulnerability-research/the-android-kernel-mitigations-obstacle-race/ Particularly, understanding the "Running arbitrary root commands using ret2kworker(TM)" section will give a general idea of the issue. Summarizing, there are thousands of dynamic data structures alloc'd and free'd in the kernel all the time, for files, for processes, and so forth, and it is elementary to manipulate any instance of data, but hard to protect every single one of them. These range from trng device pointers to kworker queues---everything passing through vmalloc. The strawman approach presented here is for security engineers to read CVE-XYZ-ABC PoC, identify the portion of the system being manipulated, and patch the allocation handler to protect just that data at the page-table layer, by: - Reorganizing allocations of those structures so that they are on the same 2MB hugepage, adjacently, as otherwise existing hardware support to prevent their mutation (PTE flags) will trigger for unrelated data allocated adjacently. - Writing a handler to ensure non-malicious modifications, e.g. keeping "const" fields const, ensuring modifications to other fields happen at the right physical PC values and the right pages, handling atomic updates so that the exception fault on these values maintains ordering under race conditions (maybe "doubling up" on atomic assembly operations due to certain microarch issues at the chipset level, see below), and so on, and so forth. Eventually, this Sisyphean task amounts to a mountain worth of point-patches and encoded wisdom, valuable but absurd insofar as there are a thousand more places for an exploit to manipulate instead of the protected ones. DATATYPE PARTITIONED VIRTUAL MEMORY ALLOCATION The above process can be generalized by changing Linux's vmalloc to behave more like seL4 (though not identically), by tying allocation itself to the typing of an object: https://docs.sel4.systems/Tutorials/untyped.html "objects should be Without the caveat that objects must be "allocated in order of size, largest first, to avoid wasting memory." I demonstrated something similar previously to prevent the intermixed allocation of SECCOMP BPF code pages with data on ARM64's Android Kernel here (with which you may be familiar): https://lore.kernel.org/all/20240423095843.446565600-1-mbland@xxxxxxxxxxxx/ That said, the above patch does not do the same for other critical dynamically allocated data. So, for instance, to prevent struct file manipulation, I've written the following code into a init-time loaded kernel (v6.1.x) module: filp_cachep_ind = (struct kmem_cache **)kallsyms_lookup_name_ind("filp_cachep"); /* Just nix the existing file cache for one which is page-aligned */ *filp_cachep_ind = kmem_cache_create( "filp", sizeof(struct file), PAGE_SIZE, SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL); I.e. aligning cache allocations to PAGE_SIZE. See the appendix for associated module code. Of course, this is a little insane since: (1) I'm effectively double allocating the cache to change how the structs are allocated, because I can't change the kernel's init process (part of this has to do with Google's GKI). (2) The kmem infrastructure needs to be also monkey patched so that this "PAGE_SIZE" alignment actually indicates that objects can still be allocated next to eachother at the originally set alignment, reducing dead space due to wasted bytes (not implemented). And, most important (3) struct file is just one case of thousands. However, it seems fine for protecting a specific, given file allocation targeted by something like: https://github.com/chompie1337/s8_2019_2215_poc/blob/34f6481ed4ed4cff661b50ac465fc73655b82f64/poc/knox_bypass.c#L50 given you also have the appropriate protection handlers (see appendix below), this works fine even outside of access to a HVCI system. Hopefully the above reasoning is clear enough. If so, the proposal (though it is not clear the best way to do this with standard C, maybe some preprocessor magic), would be to pass the data's type itself to kmem_cache_create (and other APIs used to reserve virtual memory for a struct). kmem_cache_create would then use this type identifier to allocate and resolve a region of virtual memory for just objects of that type. This is an old idea, and I've found evidence of it in, for example, Levy's discussion of Hydra in 1984's Capability-Based Computer Systems, which contains the following statement regarding object allocations: "the appropriate list for an object’s fixed part is determined by a hashing function on the object’s 64-bit name" (though my implication here is that the word "name" should be the 64 bit type. I also don't see much reference to the hardware page tables, and write exception faults which are the motivation behind the design of such a system. CONCLUSION Whatever the implications are, beyond seL4's rough sketch of this idea, I cannot find Type-Partitioned Virtual Memory Allocation coded in many other places. Hopefully, even for those unfamiliar with the exploits in question, the benefits here are clear, as it closes a certain semantic gap between heap allocations and the hardware's ability to protect memory. Thoughts? I've tried, pretty desperately, to figure out an alternative/easy solution here, but knowing current hardware exception fault handlers, I see few other ways that we will ever have a system to prevent the repercussions of write gadgets. References? I know of the existing efforts toward HVCI, KASAN, and the KSPP, but hopefully the distinction here is clear enough: I am referring, specifically to the pain of adjacency between, for example, f_lock and f_ops, and the implications that this has for hardware. From what I understand (very little), even OpenBSD does not, though maybe there has been some discussion of it somewhere in https://www.openbsd.org/papers/ ... I found nothing for all those grep-matching "alloc". Please let me know if you've seen anything else discussing this problem, particularly anything that might save me from having to rewrite the virtual memory allocator in our OS to prevent these attacks. Solutions? I have also been weighing a few other ideas, such as a second page, similar to or built on KASAN, to understand the "allocation map" for a given page: but the issue is this allocation map page, or datatype tag, must then also have a window of writability unless maintained by a hypervisor or otherwise isolated system. Thank you again for your time in considering this subject, and providing your thoughts in this public forum. Best Regards, Maxwell REFERENCES The patches/discussions here: https://lore.kernel.org/all/rsk6wtj2ibtl5yygkxlwq3ibngtt5mwpnpjqsh6vz57lino6rs@rcohctmqugn3/ https://lore.kernel.org/all/994dce8b-08cb-474d-a3eb-3970028752e6@xxxxxxxxxxxxx/ https://lore.kernel.org/all/puj3euv5eafwcx5usqostpohmxgdeq3iout4hqnyk7yt5hcsux@gpiamodhfr54/ https://lore.kernel.org/all/h4hxxozslqmqhwljg5sfold764242pmw5y77mdigaykw5ehjjs@nc4xtzw7xprm/ https://lore.kernel.org/all/20240503131910.307630-1-mic@xxxxxxxxxxx/ PoC's floating around the following CVEs: - CVE_2024_1086 (pagetable modification) - CVE_2021_33909 (seccomp codepage modification) - CVE_2022_22265 (selinux_enforcing state, AVC cache corruption) - CVE_2021_2215 (struct file pointer manipulation) - CVE_2022_22057 (kworker queue manipulation) Some public discussions I've given here include additional notes on CFI primitives and other errata (excuse my public speaking skills and ignorance, as this is a developing subject for me): https://www.youtube.com/watch?v=Rgg01n4jdBU&t=4s&pp=ygUNbWF4d2VsbCBibGFuZA%3D%3D https://www.youtube.com/watch?v=3DBGardQsHk&t=1844s&pp=ygUNbWF4d2VsbCBibGFuZA%3D%3D APPENDIX Below, I'll include a specific example of protecting struct file, for the 6.1.x kernel, you'll have to excuse the stylistic and questionable hacks here, since the GKI ensures any useful changes to the kernel need to use the always-on kernel self-patching mechanism. - Patching File Allocation: static struct file *alloc_file_handler(const struct path *path, int flags, const struct file_operations *fop) { struct file *file; file = alloc_empty_file_ind(flags, current_cred()); if (IS_ERR(file)) return file; /* TODO: had to expand out the direct struct assignment here * since the snapdragon cannot handle perm faults on stp instructions * with two input registers */ file->f_path.dentry = path->dentry; file->f_path.mnt = path->mnt; file->f_inode = path->dentry->d_inode; file->f_mapping = path->dentry->d_inode->i_mapping; file->f_wb_err = filemap_sample_wb_err(file->f_mapping); file->f_sb_err = file_sample_sb_err(file); if (fop->llseek) file->f_mode |= FMODE_LSEEK; if ((file->f_mode & FMODE_READ) && (fop->read || fop->read_iter)) file->f_mode |= FMODE_CAN_READ; if ((file->f_mode & FMODE_WRITE) && (fop->write || fop->write_iter)) file->f_mode |= FMODE_CAN_WRITE; file->f_iocb_flags = iocb_flags(file); file->f_mode |= FMODE_OPENED; file->f_op = fop; if ((file->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) i_readcount_inc(path->dentry->d_inode); /* NOTE/TODO: until the underlying vmalloc infrastructure is * patches or rewritten, it is difficult, if not impossible, * to effectively and efficiently protect all struct file's in the * kernel. The same holds for kworker queues and many other * dynamically allocated data structures. Will message mailing * list about this and maybe continue working on it for the next * decade )-: */ qcom_smc_waitloop("alloc_file_handler_smc", SMCID_TAG_MEM_PROTECT, __virt_to_phys(file), PAGE_SIZE); return file; } static void __fput_handler(struct file *file) { struct dentry *dentry = file->f_path.dentry; struct vfsmount *mnt = file->f_path.mnt; struct inode *inode = file->f_inode; fmode_t mode = file->f_mode; bool run_dput = true; if ((!(file->f_mode & FMODE_OPENED))) goto out; might_sleep(); /* Hacks because of QCOM's perm fault handler */ if (atomic_long_read(&file->f_count) == 0xFFFFFFFFFFFFFFFF) return; if (atomic_long_read(&file->f_count) == 0x0) atomic_long_set(&file->f_count, 0xFFFFFFFFFFFFFFFF); fsnotify_close(file); /* * The function eventpoll_release() should be the first called * in the file cleanup chain. */ eventpoll_release_ind(file); locks_remove_file_ind(file); ima_file_free(file); if ((file->f_flags & FASYNC)) { if (file->f_op->fasync) file->f_op->fasync(-1, file, 0); } if (file->f_op->release) file->f_op->release(inode, file); if ((S_ISCHR(inode->i_mode) && inode->i_cdev != NULL && !(mode & FMODE_PATH))) { cdev_put_ind(inode->i_cdev); } fops_put(file->f_op); put_pid(file->f_owner.pid); put_file_access(file); if (run_dput) dput(dentry); if ((mode & FMODE_NEED_UNMOUNT)) dissolve_on_fput_ind(mnt); mntput(mnt); qcom_smc_waitloop("__fput_handler_smc", SMCID_TAG_MEM_UNPROTECT, __virt_to_phys(file), PAGE_SIZE); out: file_free(file); } And on the fault handler side, because the kmem cache allocation places each struct file on a separate page. Maintaining the mappings of type is pretty easy to resolve via the SMC call. if (type == FILE_STRUCT_TYPE) { if (ipa % PAGE_SIZE == 0x048) { // manage writes to the atomic type/updates according to CASA semantics on ARM64, etc } if (ipa % PAGE_SIZE == 0x030) { // manage writes to the atomic type/updates according to CASA semantics on ARM64, etc } ... // prevent writes to f_ops, etc, etc, etc }