ASI is a technique to mitigate a broad class of CPU vulnerabilities by unmapping sensitive data from the kernel address space. If no data is mapped that needs protecting, this class of exploits cannot leak that data and so the kernel can skip expensive mitigation actions. For a more detailed overview, see the v1 RFC (which was wrongly labeled as a PATCH) [0]. This new iteration adds support for protecting against bare-metal processes as well as KVM guests. The basic principle is unchanged. .:: Multi-class ASI So far ASI has been a KVM-only solution, although I've been claiming that in principle it can be extended to also sandbox userspace. Dave Hansen's most important feedback at LPC [1] was that he wanted some evidence to support this claim. If it can be shown that ASI is just as powerful for bare-metal as for KVM, it's much more likely to actually offer an escape path from maintaining and reactively developing per-exploit mitigations. v1 already supported a notion of "ASI classes", with the only class being KVM. This RFC introduces a second class for userspace. Each process has a separate restricted address space ("domain") for each class. In v1, the only possible ASI transitions were between the KVM restricted address space, and the unrestricted address space. Now that there are multiple classes, it's possible to transition directly between two restricted address spaces. (Could we dodge this complexity by just transitioning via the unrestricted address space? Yes, but experience from Google's internal deployment suggests there's a significant benefit in avoiding an asi_exit() when switching between userspace and KVM, despite all the optimizations that exist to avoid that switching). Compared to v1, this version has a new mechanism to determine what mitigation actions are required when switching between address spaces. ASI classes provide a "taint policy" which describes what uarch state their sandboxee might leave behind, and what uarch state needs to be purged before their sandboxee can safely be run. The ASI core takes care of doing the actual flushes. This enables a reasonably advanced model of what flushes are needed when; for example the kernel is now able to model "when transitioning from a VMM to its KVM guest there is no point in flushing speculative control flow state, but if we _later_ exit to the unrestricted address space we do need to flush it". It's quite possible this is actually more advanced than what is needed so suggestions are welcome. .:: Performance issues: bogus mitigation costs Although this implementation of ASI is pretty generous in what it considers "nonsensitive", there remain unnecessary performance costs that need to be addressed. For example: - The entire page cache is removed from the direct map. Traditional file operations will hit an asi_exit(), paying a pointless cost to protect data from a process that obviously has the right to read that data. - Anything that accesses guest or user memory via the direct map instead of the user address space will hit an asi_exit(). - Pages being zeroed in the page allocator Most of these issues existed in v1 too, but now that ASI sandboxes userspace processes, the page-cache issue becomes very significant. For FIO 4k read (I suppose this workload is maximally sensitive to this issue) I saw a 70% degradation in throughput, with a Sapphire Rapids machine hard-coded to perform IBPB and RSB-stuffing on asi_exit(). Given a result like that I haven't gone into more detailed analysis. Note also that I ran with an unrealistic mitigation policy, results would be much different if ran with platform-appropriate flushes, but it would presumably lead to the same conclusion. There are some interesting discussions to be had about tackling that problem (e.g. reintroducing "local-nonsensitivity" from Junaid's 2022 ASI implementation [2], or creating ephemeral CPU-local mappings), but for this RFC I prefer to focus on deciding if the overall framework makes sense. .:: Next steps Aside from lack of userspace support, all the other issues listed in RFCv1 remain. I'll also need a proof-of-concept solution for the page-cache issue before we can credibly claim to be reaching a [PATCH], but before that I want to develop a more complete page_alloc integration. I plan to propose a topic about that at LSF/MM/BPF. Anyway, despite the further research needed on my side I think there's still useful stuff to discuss here. For example: - Does the "tainting" model make intuitive sense? Is there a simpler way to achieve something similar? - The taints offer a model for different parts of the kernel to communicate with each other about what mitigations they've taken care of. For example, KVM could clear ASI taints if it existing conditional-L1D-flush logic fires. Does it make sense to take advantage of this? (I think yes). How does this influence the design of the bugs.c kernel arguments? - Suggestions on how to map file pages into processes that can read them, while minimizing TLB management pain. Finally, a more extensive branch can be found at [3]. It has some tests and some of the lower-hanging fruit for optimising performance of KVM guests. [0] RFC v1: https://lore.kernel.org/linux-mm/20240712-asi-rfc-24-v1-0-144b319a40d8@xxxxxxxxxx/ [1] LPC session: https://lpc.events/event/18/contributions/1761/ [2] Junaid’s RFC: https://lore.kernel.org/all/20220223052223.1202152-1-junaids@xxxxxxxxxx/ [3] GitHub branch: https://github.com/googleprodkernel/linux-kvm/tree/asi-rfcv2-preview Signed-off-by: Brendan Jackman <jackmanb@xxxxxxxxxx> Ingo Molnar <mingo@xxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, "H. Peter Anvin" <hpa@xxxxxxxxx>, Andy Lutomirski <luto@xxxxxxxxxx>, Peter Zijlstra <peterz@xxxxxxxxxxxxx>, Sean Christopherson <seanjc@xxxxxxxxxx>, Paolo Bonzini <pbonzini@xxxxxxxxxx>, Alexandre Chartre <alexandre.chartre@xxxxxxxxxx>, Liran Alon <liran.alon@xxxxxxxxxx>, Jan Setje-Eilers <jan.setjeeilers@xxxxxxxxxx>, Catalin Marinas <catalin.marinas@xxxxxxx>, Will Deacon <will@xxxxxxxxxx>, Mark Rutland <mark.rutland@xxxxxxx>, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>, Mel Gorman <mgorman@xxxxxxx>, Lorenzo Stoakes <lstoakes@xxxxxxxxx>, David Hildenbrand <david@xxxxxxxxxx>, Vlastimil Babka <vbabka@xxxxxxx>, Michal Hocko <mhocko@xxxxxxxxxx>, Khalid Aziz <khalid.aziz@xxxxxxxxxx>, Juri Lelli <juri.lelli@xxxxxxxxxx>, Vincent Guittot <vincent.guittot@xxxxxxxxxx>, Dietmar Eggemann <dietmar.eggemann@xxxxxxx>, Steven Rostedt <rostedt@xxxxxxxxxxx>, Valentin Schneider <vschneid@xxxxxxxxxx>, Paul Turner <pjt@xxxxxxxxxx>, Reiji Watanabe <reijiw@xxxxxxxxxx>, Junaid Shahid <junaids@xxxxxxxxxx>, Ofir Weisse <oweisse@xxxxxxxxxx>, Yosry Ahmed <yosryahmed@xxxxxxxxxx>, Patrick Bellasi <derkling@xxxxxxxxxx>, KP Singh <kpsingh@xxxxxxxxxx>, Alexandra Sandulescu <aesa@xxxxxxxxxx>, Matteo Rizzo <matteorizzo@xxxxxxxxxx>, Jann Horn <jannh@xxxxxxxxxx> kvm@xxxxxxxxxxxxxxx, Brendan Jackman <jackmanb@xxxxxxxxxx>, Dennis Zhou <dennis@xxxxxxxxxx> --- Changes in v2: - Added support for sandboxing userspace processes. - Link to v1: https://lore.kernel.org/r/20240712-asi-rfc-24-v1-0-144b319a40d8@xxxxxxxxxx --- Brendan Jackman (21): mm: asi: Make some utility functions noinstr compatible x86: Create CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION mm: asi: Introduce ASI core API mm: asi: Add infrastructure for boot-time enablement mm: asi: ASI support in interrupts/exceptions mm: asi: Avoid warning from NMI userspace accesses in ASI context mm: Add __PAGEFLAG_FALSE mm: asi: Map non-user buddy allocations as nonsensitive [TEMP WORKAROUND] mm: asi: Workaround missing partial-unmap support mm: asi: Map kernel text and static data as nonsensitive mm: asi: Map vmalloc/vmap data as nonsensitive mm: asi: Stabilize CR3 in switch_mm_irqs_off() mm: asi: Make TLB flushing correct under ASI KVM: x86: asi: Restricted address space for VM execution mm: asi: exit ASI before accessing CR3 from C code where appropriate mm: asi: Add infrastructure for mapping userspace addresses mm: asi: Restricted execution fore bare-metal processes x86: Create library for flushing L1D for L1TF mm: asi: Add some mitigations on address space transitions x86/pti: Disable PTI when ASI is on mm: asi: Stop ignoring asi=on cmdline flag Junaid Shahid (4): mm: asi: Make __get_current_cr3_fast() ASI-aware mm: asi: ASI page table allocation functions mm: asi: Functions to map/unmap a memory range into ASI page tables mm: asi: Add basic infrastructure for global non-sensitive mappings Ofir Weisse (1): mm: asi: asi_exit() on PF, skip handling if address is accessible Reiji Watanabe (1): mm: asi: Map dynamic percpu memory as nonsensitive Yosry Ahmed (2): mm: asi: Use separate PCIDs for restricted address spaces mm: asi: exit ASI before suspend-like operations arch/alpha/include/asm/Kbuild | 1 + arch/arc/include/asm/Kbuild | 1 + arch/arm/include/asm/Kbuild | 1 + arch/arm64/include/asm/Kbuild | 1 + arch/csky/include/asm/Kbuild | 1 + arch/hexagon/include/asm/Kbuild | 1 + arch/loongarch/include/asm/Kbuild | 3 + arch/m68k/include/asm/Kbuild | 1 + arch/microblaze/include/asm/Kbuild | 1 + arch/mips/include/asm/Kbuild | 1 + arch/nios2/include/asm/Kbuild | 1 + arch/openrisc/include/asm/Kbuild | 1 + arch/parisc/include/asm/Kbuild | 1 + arch/powerpc/include/asm/Kbuild | 1 + arch/riscv/include/asm/Kbuild | 1 + arch/s390/include/asm/Kbuild | 1 + arch/sh/include/asm/Kbuild | 1 + arch/sparc/include/asm/Kbuild | 1 + arch/um/include/asm/Kbuild | 2 +- arch/x86/Kconfig | 27 + arch/x86/boot/compressed/ident_map_64.c | 10 + arch/x86/boot/compressed/pgtable_64.c | 11 + arch/x86/include/asm/asi.h | 306 +++++++++ arch/x86/include/asm/cpufeatures.h | 1 + arch/x86/include/asm/disabled-features.h | 8 +- arch/x86/include/asm/idtentry.h | 50 +- arch/x86/include/asm/kvm_host.h | 3 + arch/x86/include/asm/l1tf.h | 11 + arch/x86/include/asm/nospec-branch.h | 2 + arch/x86/include/asm/pgalloc.h | 6 + arch/x86/include/asm/pgtable_64.h | 4 + arch/x86/include/asm/processor-flags.h | 24 + arch/x86/include/asm/processor.h | 20 +- arch/x86/include/asm/pti.h | 6 +- arch/x86/include/asm/special_insns.h | 45 +- arch/x86/include/asm/tlbflush.h | 6 + arch/x86/kernel/process.c | 2 + arch/x86/kernel/process_32.c | 2 +- arch/x86/kernel/process_64.c | 2 +- arch/x86/kernel/traps.c | 22 + arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/svm/svm.c | 2 + arch/x86/kvm/vmx/nested.c | 6 + arch/x86/kvm/vmx/vmx.c | 113 ++-- arch/x86/kvm/x86.c | 81 ++- arch/x86/lib/Makefile | 1 + arch/x86/lib/l1tf.c | 96 +++ arch/x86/lib/retpoline.S | 10 + arch/x86/mm/Makefile | 1 + arch/x86/mm/asi.c | 1039 ++++++++++++++++++++++++++++++ arch/x86/mm/fault.c | 124 +++- arch/x86/mm/init.c | 7 +- arch/x86/mm/init_64.c | 25 +- arch/x86/mm/mm_internal.h | 3 + arch/x86/mm/pti.c | 14 +- arch/x86/mm/tlb.c | 167 ++++- arch/x86/virt/svm/sev.c | 2 +- arch/xtensa/include/asm/Kbuild | 1 + drivers/firmware/efi/libstub/x86-5lvl.c | 2 +- include/asm-generic/asi.h | 113 ++++ include/asm-generic/vmlinux.lds.h | 11 + include/linux/entry-common.h | 11 + include/linux/gfp.h | 5 + include/linux/gfp_types.h | 15 +- include/linux/mm_types.h | 7 + include/linux/page-flags.h | 18 + include/linux/pgtable.h | 3 + include/trace/events/mmflags.h | 12 +- init/main.c | 2 + kernel/entry/common.c | 1 + kernel/fork.c | 5 + kernel/sched/core.c | 9 + mm/init-mm.c | 4 + mm/internal.h | 2 + mm/mm_init.c | 1 + mm/page_alloc.c | 160 ++++- mm/percpu-vm.c | 50 +- mm/percpu.c | 4 +- mm/vmalloc.c | 53 +- tools/perf/builtin-kmem.c | 1 + 80 files changed, 2582 insertions(+), 190 deletions(-) --- base-commit: ebd6ea9c6976c64ed5af3e6dce672616447e8e62 change-id: 20241115-asi-rfc-v2-5d9bbb441186 Best regards, -- Brendan Jackman <jackmanb@xxxxxxxxxx>