Re: [LSF/MM/BPF TOPIC] Address Space Isolation

Petr Tesařík <petr@xxxxxxxxxxx> · Tue, 12 Mar 2024 15:48:28 +0100

On Thu, 29 Feb 2024 10:57:21 +0100
Brendan Jackman <jackmanb@xxxxxxxxxx> wrote:

> Address Space Isolation (ASI) is a technique to mitigate broad classes
> of CPU vulnerabilities.
> 
> ASI logically separates memory into “sensitive” and “nonsensitive”,
> the former is memory that may contain secrets and the latter is memory
> that, though it might be owned by a privileged component, we don’t
> actually care about leaking. The core aim is to execute comprehensive
> mitigations for protecting sensitive memory, while avoiding the cost
> of protected nonsensitive memory. This is implemented by creating
> another “restricted” address space for the kernel, in which sensitive
> data is not mapped.
> 
> The implementation contains two broad areas of functionality:
> 
> ::Sensitivity tracking:: provides mechanisms to determine which data
> is sensitive and keep the restricted address space page tables
> up-to-date. At present this is done by adding new allocator flags
> which allocation sites use to annotate data whenever its sensitivity
> differs from the default.
> 
> The definition of “sensitive” memory isn’t a topic we’ve fully
> explored yet - it’s possible that this will vary from any given
> deployment to the next. The framework is implemented so that any given
> allocation can have an arbitrary sensitivity setting.
> 
> What is “sensitive” is in reality of course contextual. User data is
> sensitive in the general sense, but we don’t really care if a user is
> able to leak _its own_ data via CPU bugs. In one implementation we
> divide “nonsensitive” data into “global” and “local” nonsensitive.
> Local-nonsensitive data is mapped into the restricted address space of
> the entity (process/KVM guest) that it belongs to. This adds quite a
> lot of complexity, so at present we’re working without
> local-nonsensitivity support - if we can achieve all the security
> coverage we want with acceptable performance then this will be big
> maintainability win.
> 
> The biggest challenge we’ve faced so far in sensitivity tracking is
> that transitioning memory from nonsensitive to sensitive requires
> flushing the TLB. Aside from the performance impact, this cannot be
> done with IRQs disabled. The simple workaround for this is to keep all
> free pages unmapped from the restricted address space (so that they
> can be allocated with any sensitivity without a TLB flush), and
> process freeing of nonsensitive pages (requiring a TLB flush under
> this simple scheme) via an asynchronous worker. This creates lots of
> unnecessary TLB flushes, but perhaps worse it can create artificial
> OOM conditions as pages are stranded on the asynchronous worker’s
> queue.
> 
> ::Sandboxing:: is the logic that switches between address spaces and
> executes actual mitigations. Before running untrusted code, i.e.
> userspace processes and KVM guests, the kernel enters the restricted
> address space. If a later kernel entry accesses sensitive data - as
> detected by a page fault - it returns to the normal kernel address
> space. Each of these address space transitions involves a buffer
> flush: on exiting the restricted address space (that is, right before
> accessing sensitive data for the first time since running untrusted
> code) we flush branch prediction buffers that can be exploited through
> Spectre-like attacks. On entering the restricted address space (that
> is, right before running untrusted code for the first time since
> accessing sensitive data) we flush data buffers that can be exploited
> as side channels with Meltdown-like attacks. The “happy path” for ASI
> is getting back to the untrusted code without accessing any secret,
> and thus incurring no buffer flushes. If the sensitive/nonsensitive
> distinction is well-chosen, it should be possible to afford extremely
> defensive buffer-flushes on address space transitions, since those
> transitions are rare.
> 
> Some interesting details of sandboxing logic relate to interrupt
> handling: when an interrupt triggers a transition out of the
> restricted address space, we may need to return to it before exiting
> the interrupt. A simple implementation could just unconditionally
> return to the original address space after servicing any interrupt,
> but that can also lead to unnecessary transitions. Thus ASI becomes
> something like a small state machine.
> 
> ASI has been proposed and discussed several times over the years, most
> recently by Junaid Shahid & co in [1] and [2]. Since then, the
> sophistication of CPU bug exploitation has advanced Google’s interest
> in ASI has continued to grow. We’re now working on deploying an
> internal implementation, to prove that this concept has real-world
> value. Our current implementation has undergone lots of testing and is
> now close to production-ready.
> 
> We’d like to share our progress since the last RFC and discuss the
> challenges we’ve faced so far in getting this feature
> production-ready. Hopefully this will prompt interesting discussion to
> guide the next upstream posting.  Some areas that would be fruitful to
> discuss:
> 
> - Feedback on the overall design.
> 
> - How we’ve generalised ASI as a framework that goes beyond the KVM use case
> 
> - How we’ve implemented sensitivity tracking as a “deny list” to ease
> initial deployment, and how to develop this into a longer-term
> solution. The policy defining what memory objects are
> sensitive/non-sensitive ought to be decoupled from the ASI framework
> and ideally even from the code that allocates memory
> 
> - If/how KPTI should be implemented in the ASI framework. We plan to add
> a Userspace-ASI class that would map all non-sensitive kernel memory
> in the restricted address space, but perhaps there may be value in
> also having an ASI class that mirrors  exactly how the current KPTI
> works.
> 
> - How we’ve solved the TLB flushing issues in sensitivity tracking, and
> how it could be done better.

Hello and welcome! I ran into a similar challenge with SandBox Mode. My
solution was to run sandbox code with CPL=3 (on x86) and control page
access with the U/S PTE bit rather than the P bit, which allowed me to
implement lazy TLB invalidation. The x86 folks didn't like idea...

For the record, SandBox Mode was designed with confidentiality in mind,
although the initial patch series left out this part for simplicity. I
wonder if your objective is to protect kernel data from user space, or
if you have also considered decomposing the kernel into components that
are isolated from each other (and then it we could potentially find
some synergies).

Petr T