[LSF/MM/BPF TOPIC] Address Space Isolation

Brendan Jackman <jackmanb@xxxxxxxxxx> · Thu, 29 Feb 2024 10:57:21 +0100

Address Space Isolation (ASI) is a technique to mitigate broad classes
of CPU vulnerabilities.

ASI logically separates memory into “sensitive” and “nonsensitive”,
the former is memory that may contain secrets and the latter is memory
that, though it might be owned by a privileged component, we don’t
actually care about leaking. The core aim is to execute comprehensive
mitigations for protecting sensitive memory, while avoiding the cost
of protected nonsensitive memory. This is implemented by creating
another “restricted” address space for the kernel, in which sensitive
data is not mapped.

The implementation contains two broad areas of functionality:

::Sensitivity tracking:: provides mechanisms to determine which data
is sensitive and keep the restricted address space page tables
up-to-date. At present this is done by adding new allocator flags
which allocation sites use to annotate data whenever its sensitivity
differs from the default.

The definition of “sensitive” memory isn’t a topic we’ve fully
explored yet - it’s possible that this will vary from any given
deployment to the next. The framework is implemented so that any given
allocation can have an arbitrary sensitivity setting.

What is “sensitive” is in reality of course contextual. User data is
sensitive in the general sense, but we don’t really care if a user is
able to leak _its own_ data via CPU bugs. In one implementation we
divide “nonsensitive” data into “global” and “local” nonsensitive.
Local-nonsensitive data is mapped into the restricted address space of
the entity (process/KVM guest) that it belongs to. This adds quite a
lot of complexity, so at present we’re working without
local-nonsensitivity support - if we can achieve all the security
coverage we want with acceptable performance then this will be big
maintainability win.

The biggest challenge we’ve faced so far in sensitivity tracking is
that transitioning memory from nonsensitive to sensitive requires
flushing the TLB. Aside from the performance impact, this cannot be
done with IRQs disabled. The simple workaround for this is to keep all
free pages unmapped from the restricted address space (so that they
can be allocated with any sensitivity without a TLB flush), and
process freeing of nonsensitive pages (requiring a TLB flush under
this simple scheme) via an asynchronous worker. This creates lots of
unnecessary TLB flushes, but perhaps worse it can create artificial
OOM conditions as pages are stranded on the asynchronous worker’s
queue.

::Sandboxing:: is the logic that switches between address spaces and
executes actual mitigations. Before running untrusted code, i.e.
userspace processes and KVM guests, the kernel enters the restricted
address space. If a later kernel entry accesses sensitive data - as
detected by a page fault - it returns to the normal kernel address
space. Each of these address space transitions involves a buffer
flush: on exiting the restricted address space (that is, right before
accessing sensitive data for the first time since running untrusted
code) we flush branch prediction buffers that can be exploited through
Spectre-like attacks. On entering the restricted address space (that
is, right before running untrusted code for the first time since
accessing sensitive data) we flush data buffers that can be exploited
as side channels with Meltdown-like attacks. The “happy path” for ASI
is getting back to the untrusted code without accessing any secret,
and thus incurring no buffer flushes. If the sensitive/nonsensitive
distinction is well-chosen, it should be possible to afford extremely
defensive buffer-flushes on address space transitions, since those
transitions are rare.

Some interesting details of sandboxing logic relate to interrupt
handling: when an interrupt triggers a transition out of the
restricted address space, we may need to return to it before exiting
the interrupt. A simple implementation could just unconditionally
return to the original address space after servicing any interrupt,
but that can also lead to unnecessary transitions. Thus ASI becomes
something like a small state machine.

ASI has been proposed and discussed several times over the years, most
recently by Junaid Shahid & co in [1] and [2]. Since then, the
sophistication of CPU bug exploitation has advanced Google’s interest
in ASI has continued to grow. We’re now working on deploying an
internal implementation, to prove that this concept has real-world
value. Our current implementation has undergone lots of testing and is
now close to production-ready.

We’d like to share our progress since the last RFC and discuss the
challenges we’ve faced so far in getting this feature
production-ready. Hopefully this will prompt interesting discussion to
guide the next upstream posting.  Some areas that would be fruitful to
discuss:

- Feedback on the overall design.

- How we’ve generalised ASI as a framework that goes beyond the KVM use case

- How we’ve implemented sensitivity tracking as a “deny list” to ease
initial deployment, and how to develop this into a longer-term
solution. The policy defining what memory objects are
sensitive/non-sensitive ought to be decoupled from the ASI framework
and ideally even from the code that allocates memory

- If/how KPTI should be implemented in the ASI framework. We plan to add
a Userspace-ASI class that would map all non-sensitive kernel memory
in the restricted address space, but perhaps there may be value in
also having an ASI class that mirrors  exactly how the current KPTI
works.

- How we’ve solved the TLB flushing issues in sensitivity tracking, and
how it could be done better.

[1] https://lore.kernel.org/all/20220223052223.1202152-1-junaids@xxxxxxxxxx/
[2] https://www.phoronix.com/news/Google-LPC-ASI-2022