Address Space Isolation (ASI) is a technique to mitigate broad classes of CPU vulnerabilities. ASI logically separates memory into “sensitive” and “nonsensitive”, the former is memory that may contain secrets and the latter is memory that, though it might be owned by a privileged component, we don’t actually care about leaking. The core aim is to execute comprehensive mitigations for protecting sensitive memory, while avoiding the cost of protected nonsensitive memory. This is implemented by creating another “restricted” address space for the kernel, in which sensitive data is not mapped. The implementation contains two broad areas of functionality: ::Sensitivity tracking:: provides mechanisms to determine which data is sensitive and keep the restricted address space page tables up-to-date. At present this is done by adding new allocator flags which allocation sites use to annotate data whenever its sensitivity differs from the default. The definition of “sensitive” memory isn’t a topic we’ve fully explored yet - it’s possible that this will vary from any given deployment to the next. The framework is implemented so that any given allocation can have an arbitrary sensitivity setting. What is “sensitive” is in reality of course contextual. User data is sensitive in the general sense, but we don’t really care if a user is able to leak _its own_ data via CPU bugs. In one implementation we divide “nonsensitive” data into “global” and “local” nonsensitive. Local-nonsensitive data is mapped into the restricted address space of the entity (process/KVM guest) that it belongs to. This adds quite a lot of complexity, so at present we’re working without local-nonsensitivity support - if we can achieve all the security coverage we want with acceptable performance then this will be big maintainability win. The biggest challenge we’ve faced so far in sensitivity tracking is that transitioning memory from nonsensitive to sensitive requires flushing the TLB. Aside from the performance impact, this cannot be done with IRQs disabled. The simple workaround for this is to keep all free pages unmapped from the restricted address space (so that they can be allocated with any sensitivity without a TLB flush), and process freeing of nonsensitive pages (requiring a TLB flush under this simple scheme) via an asynchronous worker. This creates lots of unnecessary TLB flushes, but perhaps worse it can create artificial OOM conditions as pages are stranded on the asynchronous worker’s queue. ::Sandboxing:: is the logic that switches between address spaces and executes actual mitigations. Before running untrusted code, i.e. userspace processes and KVM guests, the kernel enters the restricted address space. If a later kernel entry accesses sensitive data - as detected by a page fault - it returns to the normal kernel address space. Each of these address space transitions involves a buffer flush: on exiting the restricted address space (that is, right before accessing sensitive data for the first time since running untrusted code) we flush branch prediction buffers that can be exploited through Spectre-like attacks. On entering the restricted address space (that is, right before running untrusted code for the first time since accessing sensitive data) we flush data buffers that can be exploited as side channels with Meltdown-like attacks. The “happy path” for ASI is getting back to the untrusted code without accessing any secret, and thus incurring no buffer flushes. If the sensitive/nonsensitive distinction is well-chosen, it should be possible to afford extremely defensive buffer-flushes on address space transitions, since those transitions are rare. Some interesting details of sandboxing logic relate to interrupt handling: when an interrupt triggers a transition out of the restricted address space, we may need to return to it before exiting the interrupt. A simple implementation could just unconditionally return to the original address space after servicing any interrupt, but that can also lead to unnecessary transitions. Thus ASI becomes something like a small state machine. ASI has been proposed and discussed several times over the years, most recently by Junaid Shahid & co in [1] and [2]. Since then, the sophistication of CPU bug exploitation has advanced Google’s interest in ASI has continued to grow. We’re now working on deploying an internal implementation, to prove that this concept has real-world value. Our current implementation has undergone lots of testing and is now close to production-ready. We’d like to share our progress since the last RFC and discuss the challenges we’ve faced so far in getting this feature production-ready. Hopefully this will prompt interesting discussion to guide the next upstream posting. Some areas that would be fruitful to discuss: - Feedback on the overall design. - How we’ve generalised ASI as a framework that goes beyond the KVM use case - How we’ve implemented sensitivity tracking as a “deny list” to ease initial deployment, and how to develop this into a longer-term solution. The policy defining what memory objects are sensitive/non-sensitive ought to be decoupled from the ASI framework and ideally even from the code that allocates memory - If/how KPTI should be implemented in the ASI framework. We plan to add a Userspace-ASI class that would map all non-sensitive kernel memory in the restricted address space, but perhaps there may be value in also having an ASI class that mirrors exactly how the current KPTI works. - How we’ve solved the TLB flushing issues in sensitivity tracking, and how it could be done better. [1] https://lore.kernel.org/all/20220223052223.1202152-1-junaids@xxxxxxxxxx/ [2] https://www.phoronix.com/news/Google-LPC-ASI-2022