Overview ======== This RFC demonstrates an implementation of Address Space Isolation (ASI), similar to Junaid Shahid’s proposal from 2022 [1]. Until now, mitigating hardware vulnerabilities has required one or both of: - Highly custom mitigations being developed under pressure for every specific exploit, - Prohibitive performance penalties. ASI is an attempt to improve both of these points by providing a single technique that mitigates a very broad class of vulnerabilities while still achieving a tolerable performance overhead. The basic idea is to run the kernel in a “restricted address space”, where any page that could contain “sensitive” data is unmapped. When the kernel needs to access such data, a page fault occurs, in which we switch back to the normal (“unrestricted”) address space and perform vulnerability mitigations. Before returning to potentially malicious code (VM guest/userspace) we transition back into the restricted address space and get a chance to perform additional mitigations. Thus, we only pay the cost of security mitigations for kernel entries (such as VM Exit) that actually access sensitive data. If we can arrange for these accesses to be infrequent, it becomes viable to perform aggressive mitigations on address space transitions. For example, in this RFC we attempt to obliterate indirect branch predictor training, without needing to concern ourselves too much with microarchitectural details of specific exploits. My talk at LSF/MM/BPF this year [2] has some additional conceptual introduction with diagrams etc, plus some more detailed discussion of the strategic pros and cons of ASI. Junaid’s RFC cover letter [1] has some additional discussion too, I won’t rehash it in detail. Like Junaid’s RFC, this only implements ASI for protecting against malicious KVM guests; this is a somewhat simpler use-case to start with. However, ASI is written as a framework so that we can later use it to sandbox bare metal processes too. Work has begun on prototyping this but we don’t have a working implementation yet. Rough structure of this series: - 01-14: Establish ASI infrastructure, e.g. for manipulating pagetables, performing address space transitions. - 15-19: Map data into the restricted address space. - 20-23: Finalize a functionality correct ASI for KVM. - 24-26: Switch it on and demonstrate actual vuln mitigation. What’s new in this RFC? ======================= Since Junaid’s initial efforts, Google has steadily invested more and more deeply towards ASI as a keystone of hardware security. This RFC is basically the same system that Junaid presented, but I’ve done my best to shrink it as much as possible. So, this is really just enough to demonstrate ASI working end-to-end. The most radical simplifications are the removal of “local nonsensitive” memory (see [1] for explanation) and the removal of all of the TLB-flushing smarts. Those will be implemented later as an enhancement. What’s needed to make this a PATCH? =================================== .:: Major problems Aside from general missing features and performance issues there are two major problems with this patchset: 1. It adds a page flag. 2. It creates artificial OOM conditions. See “mm: asi: Map non-user buddy allocations as nonsensitive” for details of both problems. I hope to solve these with a more intrusive but less hacky integration into the buddy allocator. This was discussed at LSF/MM/BPF [2], I won’t go into detail here, I just failed to get a prototype ready in time for this RFC. I’ll need to have one ready before I can reasonably ask to merge anything. It remains an open question if we can find a way to merge a minimal ASI without that complex integration, without creating technical debt such as a page flag. .:: Configuration As well as the above, I think it needs a cleaner idea of how ASI should be configured. In this RFC, it’s enabled by setting asi=on on the kernel command-line, and has barely any interaction with bugs.c. ASI does not trivially fit into the existing configuration mechanism: a. Existing mitigations are generally configured per-vuln, while ASI is not a per-vuln mitigation. b. ASI will never be strictly equivalent to any other mitigation configuration (because it deliberately drops protection for at least some memory), so making it the default represents a moderately bold policy decision. ASI also warrants configuration beyond on/off: In general because it provides a way to avoid paying mitigation cost most of the time, in my opinion ASI is best used in a mode that mitigates exploits beyond those that are currently known to be possible on a given platform. For example, in this RFC we attempt to obliterate _all_ indirect branch predictor training before leaving the restricted address space, even on platforms where no practical exploit is known to necessitate this. But I expect many users to reject this philosophy, and the kernel ought to support a different policy. Input on this topic would be appreciated - even if it feels like bikeshedding, I think it’s likely to provoke more interesting discussion as a side effect. Otherwise I’ll just come up with _something_ and we can discuss more at [PATCH] time. Perhaps a simple starting point would be “mitigations=asi”. .:: Minor issues - KVM’s rseq_test fails with asi=on. I think this is “just” a performance problem; KVM rseq logic is known to trigger ASI transitions without additional optimisations that will be explored for a later series. - fill_return_buffer() causes an “unreachable instruction” objtool warning. I haven’t investigated this. - Some BUGs that should probably not crash the kernel. What is “sensitive memory”? =========================== ASI is fundamentally creating a new security boundary. So, where does the boundary go? In other words, what gets mapped into the restricted address space? This is determined at allocation time. In this RFC, there is a new __GFP_SENSITIVE flag (currently only supported for buddy allocations, not slab), and everything else is considered non-sensitive. This default-nonsensitive approach is known as a “denylist” model. By simply adding __GFP_SENSITIVE to GFP_USER, we can already deliver significant protection from real-world attacks, while already being within reach of pretty high performance results (more on this later). However, it’s obviously not the case that all data worth leaking is always in GFP_USER pages. There are two ways to respond to this problem: 1. Expand the denylist, i.e. try to set __GFP_SENSITIVE for all memory that can contain secrets. 2. Switch to an “allowlist” model where sensitive is the default. Then our job would instead be to set __GFP_NONSENSITIVE wherever we can determine it’s safe and worthwhile for performance. Option 2 clearly puts us in a stronger security posture, but it has the major disadvantage of risking unpredictable performance impacts: since ASI transitions are costly, a random system change that causes new pages to start being touched by the kernel is much more likely to create sudden, hard-to-diagnose performance degradations. This makes switching ASI on in production a much scarier proposition. Opinions at LSF/MM/BPF were surprisingly relaxed about this topic. So if possible I’d like to prefer option 1, and focus on getting Linux as soon as possible to a version of ASI that’s viable to run in production, and from there iterate towards stronger security guarantees. However, discussion is welcome. Performance =========== I’m a little embarrassed that I don’t have performance data with this RFC, progress on getting this data has been painful so I decided to just get discussion started on the implementation, and I hope to follow up soon with data. Since the initial patchset I’ll be proposing to merge will be minimal (something similar in scope to this RFC), we should expect it to perform badly. So, I’ll need to put together a forward-looking branch that includes that patchset plus additional features from future patchsets, so that we can prove that good performance is achievable longer-term. Google’s internal version of ASI shows less than 5% degradation on all end-to-end performance metrics, less than 1% is common. However for some workloads this has required more advanced optimisations than those I expect to post in the initial upstream branch, so we can expect a worse degradation in some cases. The branch that I published for LSF/MM/BPF [2] (not radically different from this RFC) showed comparable performance to Safe RET for a single-VM Redis benchmark (<5%), although this was not a rigorous analysis. See [5] for a graph showing that ASI performs dramatically better than a comparable blanket mitigation (IBPB on VM Exit). I’m planning to try and run either the VM-supported workloads from mmtests [3], or some set of workloads from PerfKit Benchmarker [4], whichever turns out to be easiest. I’ll compare ASI against mitigations=off and one or two example configurations for existing mitigations. Let me know if you have any specific requests/suggestions for workloads or baseline-comparisons. What’s next? ============ This cover letter is getting rather long, but briefly here are some work items that need to be done for a “complete ASI”, but which I’d like to defer until infrastructure is already in place in-tree: - More sensitivity annotations, which will require more allocator integrations - More advanced/flexible mitigations in address space transitions - Support for sandboxing bare-metal processes - Avoid address space transitions by expanding the scope of what can be run in the restricted address space (e.g. context-switching between tasks in the same mm, returning to userspace) - Deferring TLB flushing and using PCID properly - Preventing cross-SMT attacks by halting sibling hyperthreads - Non-x86 support (this isn’t prototyped at all, requires research, probably a much longer-term topic). Acknowledgements ================ Thanks to Alexander Chartre for the initial implementation that inspired Junaid’s RFC. Of course thanks to Junaid Shahid and Ofir Weisse for their fantastic work on the 2022 RFC and Google’s initial internal implementation. Reiji Watanabe, Yosry Ahmed and Patrick Bellasi are also major contributors to this effort from Google (you’ll see them attributed in commit messages too). Further thanks to Alexandra Sandulescu and Matteo Rizzo who have provided security expertise for Google’s deployment. Alexandra is also working on reliable easy-to-run exploit PoCs (as kernel selftests) which have helped us to gain confidence that ASI actually mitigates vulnerabilities. References ========== [1] Junaid’s RFC: https://lore.kernel.org/all/20220223052223.1202152-1-junaids@xxxxxxxxxx/ [2] LSF/MM/BPF: https://www.youtube.com/watch?v=DxaN6X_fdlI LWN coverage: https://lwn.net/Articles/974390/ Code: http://github.com/googleprodkernel/linux-kvm/tree/asi-lsfmmbpf-24 [3] mmtests: https://github.com/gormanm/mmtests [4] PerfKit Benchmarker: https://github.com/GoogleCloudPlatform/PerfKitBenchmarker [5] Performance data at LSF/MM/BPF (timestamp link): https://youtu.be/DxaN6X_fdlI?t=557 To: Thomas Gleixner <tglx@xxxxxxxxxxxxx> To: Ingo Molnar <mingo@xxxxxxxxxx> To: Borislav Petkov <bp@xxxxxxxxx> To: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> To: H. Peter Anvin <hpa@xxxxxxxxx> To: Andy Lutomirski <luto@xxxxxxxxxx> To: "H. Peter Anvin" <hpa@xxxxxxxxx> To: Peter Zijlstra <peterz@xxxxxxxxxxxxx> To: Sean Christopherson <seanjc@xxxxxxxxxx> To: Paolo Bonzini <pbonzini@xxxxxxxxxx> To: Alexandre Chartre <alexandre.chartre@xxxxxxxxxx> To: Liran Alon <liran.alon@xxxxxxxxxx> To: Jan Setje-Eilers <jan.setjeeilers@xxxxxxxxxx> To: Catalin Marinas <catalin.marinas@xxxxxxx> To: Will Deacon <will@xxxxxxxxxx> To: Mark Rutland <mark.rutland@xxxxxxx> To: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> To: Mel Gorman <mgorman@xxxxxxx> To: Lorenzo Stoakes <lstoakes@xxxxxxxxx> To: David Hildenbrand <david@xxxxxxxxxx> To: Vlastimil Babka <vbabka@xxxxxxx> To: Michal Hocko <mhocko@xxxxxxxxxx> To: Khalid Aziz <khalid.aziz@xxxxxxxxxx> To: Juri Lelli <juri.lelli@xxxxxxxxxx> To: Vincent Guittot <vincent.guittot@xxxxxxxxxx> To: Dietmar Eggemann <dietmar.eggemann@xxxxxxx> To: Steven Rostedt <rostedt@xxxxxxxxxxx> To: Valentin Schneider <vschneid@xxxxxxxxxx> To: Paul Turner <pjt@xxxxxxxxxx> To: Reiji Watanabe <reijiw@xxxxxxxxxx> To: Junaid Shahid <junaids@xxxxxxxxxx> To: Ofir Weisse <oweisse@xxxxxxxxxx> To: Yosry Ahmed <yosryahmed@xxxxxxxxxx> To: Patrick Bellasi <derkling@xxxxxxxxxx> To: KP Singh <kpsingh@xxxxxxxxxx> To: Alexandra Sandulescu <aesa@xxxxxxxxxx> To: Matteo Rizzo <matteorizzo@xxxxxxxxxx> To: Jann Horn <jannh@xxxxxxxxxx> Cc: x86@xxxxxxxxxx Cc: linux-kernel@xxxxxxxxxxxxxxx Cc: linux-mm@xxxxxxxxx Cc: kvm@xxxxxxxxxxxxxxx Signed-off-by: Brendan Jackman <jackmanb@xxxxxxxxxx> --- Brendan Jackman (15): x86: Create CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION objtool: let some noinstr functions make indirect calls mm: asi: Add infrastructure for boot-time enablement mm: asi: ASI support in interrupts/exceptions mm: asi: Avoid warning from NMI userspace accesses in ASI context mm: Add __PAGEFLAG_FALSE mm: asi: Map non-user buddy allocations as nonsensitive mm: asi: Map kernel text and static data as nonsensitive mm: asi: Map vmalloc/vmap data as nonsesnitive KVM: x86: asi: Restricted address space for VM execution KVM: x86: asi: Stabilize CR3 when potentially accessing with ASI mm: asi: Stabilize CR3 in switch_mm_irqs_off() mm: asi: Make TLB flushing correct under ASI mm: asi: Stop ignoring asi=on cmdline flag KVM: x86: asi: Add some mitigations on address space transitions Junaid Shahid (8): mm: asi: Make some utility functions noinstr compatible mm: asi: Introduce ASI core API mm: asi: Switch to unrestricted address space before a context switch mm: asi: Use separate PCIDs for restricted address spaces mm: asi: Make __get_current_cr3_fast() ASI-aware mm: asi: ASI page table allocation functions mm: asi: Functions to map/unmap a memory range into ASI page tables mm: asi: Add basic infrastructure for global non-sensitive mappings Ofir Weisse (1): mm: asi: asi_exit() on PF, skip handling if address is accessible Reiji Watanabe (1): mm: asi: Map dynamic percpu memory as nonsensitive Yosry Ahmed (1): percpu: clean up all mappings when pcpu_map_pages() fails arch/alpha/include/asm/Kbuild | 1 + arch/arc/include/asm/Kbuild | 1 + arch/arm/include/asm/Kbuild | 1 + arch/arm64/include/asm/Kbuild | 1 + arch/csky/include/asm/Kbuild | 1 + arch/hexagon/include/asm/Kbuild | 1 + arch/loongarch/include/asm/Kbuild | 1 + arch/m68k/include/asm/Kbuild | 1 + arch/microblaze/include/asm/Kbuild | 1 + arch/mips/include/asm/Kbuild | 1 + arch/nios2/include/asm/Kbuild | 1 + arch/openrisc/include/asm/Kbuild | 1 + arch/parisc/include/asm/Kbuild | 1 + arch/powerpc/include/asm/Kbuild | 1 + arch/riscv/include/asm/Kbuild | 1 + arch/s390/include/asm/Kbuild | 1 + arch/sh/include/asm/Kbuild | 1 + arch/sparc/include/asm/Kbuild | 1 + arch/um/include/asm/Kbuild | 1 + arch/x86/Kconfig | 27 ++ arch/x86/include/asm/asi.h | 267 +++++++++++ arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/idtentry.h | 50 ++- arch/x86/include/asm/kvm_host.h | 5 + arch/x86/include/asm/nospec-branch.h | 2 + arch/x86/include/asm/processor.h | 15 +- arch/x86/include/asm/special_insns.h | 8 +- arch/x86/include/asm/tlbflush.h | 5 + arch/x86/kernel/process.c | 2 + arch/x86/kernel/traps.c | 22 + arch/x86/kvm/svm/svm.c | 2 + arch/x86/kvm/vmx/nested.c | 8 + arch/x86/kvm/vmx/vmx.c | 124 +++-- arch/x86/kvm/x86.c | 60 ++- arch/x86/lib/retpoline.S | 7 + arch/x86/mm/Makefile | 1 + arch/x86/mm/asi.c | 748 +++++++++++++++++++++++++++++++ arch/x86/mm/fault.c | 119 ++++- arch/x86/mm/init.c | 5 +- arch/x86/mm/init_64.c | 25 +- arch/x86/mm/mm_internal.h | 3 + arch/x86/mm/tlb.c | 136 +++++- arch/xtensa/include/asm/Kbuild | 1 + include/asm-generic/asi.h | 84 ++++ include/asm-generic/vmlinux.lds.h | 11 + include/linux/compiler_types.h | 8 + include/linux/gfp_types.h | 15 +- include/linux/mm_types.h | 7 + include/linux/page-flags.h | 16 + include/linux/pgtable.h | 3 + include/trace/events/mmflags.h | 12 +- kernel/fork.c | 3 + kernel/sched/core.c | 3 + mm/init-mm.c | 4 + mm/internal.h | 2 + mm/page_alloc.c | 143 +++++- mm/percpu-vm.c | 52 ++- mm/percpu.c | 4 +- mm/vmalloc.c | 61 ++- tools/objtool/check.c | 14 + tools/perf/builtin-kmem.c | 1 + 62 files changed, 1977 insertions(+), 136 deletions(-) --- base-commit: a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6 change-id: 20240524-asi-rfc-24-2ea47c41352d Best regards, -- Brendan Jackman <jackmanb@xxxxxxxxxx>