> From: Mark Rutland <mark.rutland@xxxxxxx> On Tue, Feb 13, 2024 at 10:05:45AM > -0600, Maxwell Bland wrote: > > VMALLOC_START ffff800080000000 VMALLOC_END fffffbfff0000000 _text > > ffffb6c0c1400000 _end ffffb6c0c3e40000 > > > > Setting VMALLOC_END to _text in init would resolve this issue with the > > caveat of a sizeable reduction in the size of available vmalloc memory due > > to requirements on aslr randomness. However, there are circumstances where > > this trade-off is necessary: in particular, hypervisor-level security > > monitors where 1) the microarchitecture contains race conditions on PTE > > level updates or 2) a per-PTE update verifier comes at a significant hit to > > performance. > > Which "hypervisor-level security monitors" are you referring to? Right now there are around 4 or 5 different attempts (from what I know: Moto, Samsung, MediaTek, and Qualcomm) at making page tables immutable and reducing the kernel threat surface to just dynamically allocated structs, e.g. file_operations, in ARM, a revival of some of the ideas of: https://wenboshen.org/publications/papers/tz-rkp-ccs14.pdf Which are no longer possible to enforce for a number of reasons. As related to this patch in particular: the performance hits involved in per-PTE update verification are huge. My goal is ultimately to prevent modern exploits like: https://github.com/chompie1337/s8_2019_2215_poc which modify dynamically allocated pointers, but trying to protect against these exploits is disingenuous without first being able to enforce PXN on non-code pages, i.e. there is a reason we do this in mm initialization, but we need to enforce or support the enforcement of PXNTable dynamically too. > We don't support any of those upstream AFAIK. As is hopefully apparent from the above, though it will help downstream systems, I do not see this patch as a support issue so much as a legitimate security feature. There is the matter of deciding which subsystem should be responsible. The generic vmalloc interface should provide a strong distinction between code and data allocations, but enforcing this would become the responsibility of each microarchitecture regardless. > > How much VA space are you potentially throwing away? > This is rough, I admit. )-: On the order of 70,000 GB, likely more in practice: it restricts vmalloc to the region before _text. You may be thinking, "that is ridiculous, c'mon Maxwell", and you would be right, but I was OK with this trade-off for Moto systems, and was thinking the approach keeps the patch changes small and simple. I had a hard time thinking of a better way to do this while avoiding duplication of vmalloc code into arm64 land. Potentially, though, it would be OK to add an additional field to the generic vmalloc interface? I may need to reach out for help here: maybe the solution to the issue will come more readily to those with more experience. > How does this work with other allocations of executable memory? e.g. modules, > BPF? It should work. - arch/arm64/kernel/module.c uses __vmalloc_node_range with module_alloc_base and module_alloc_end, bypassing the generic vmalloc_node region, and these variables are decided based on a random offset between _text and _end. - kernel/bpf/core.c uses bpf_jit_alloc_exec to create executable code regions, which is a wrapper for module_alloc. In the interpreted BPF case, we do not need to worry since the pages storing interpreted code are NX and can be marked PXNTable regardless. > I'm not keen on this as-is. That's OK, so long as we agree enforcing PXNTable dynamically would be a good thing. I look forward to your thoughts on the above, and I will go back and iterate. Working with IT to fix the email formatting now, so I will hopefully be able to post a fetchable and runnable version of my initial patch shortly.