[LSF/MM/BPF TOPIC] Page allocation for ASI

Brendan Jackman <jackmanb@xxxxxxxxxx> · Wed, 29 Jan 2025 12:40:33 +0000

At last year’s LSF/MM/BPF I presented Address Space Isolation (ASI) [0]. ASI is
a mitigation for a broad class of CPU vulnerabilities that works by creating a
second “restricted” kernel address space which has “sensitive” data unmapped. If
you’re unfamiliar with ASI, the first 10-15 minutes of that talk provide a broad
overview of the whole system. The v1 of my RFC [2] also has some explanatory
discussion in the cover letter.

Last year my talk was pretty high-level, taking the temperature of the MM
community about how to integrate this into the broader kernel and whether there
are any major roadblocks.

Since then, I’ve posted a new RFC [1] and Google’s internal implementation has
continued to expand its footprint in production - it’s now a cornerstone of our
CPU security strategy. Nonetheless, as noted in the RFCv2 cover-letter there are
a few hurdles to overcome, at least in a proof-of-concept, before I’ll be making
actual requests to merge ASI upstream.

The one I’d like to talk about at this session is how to best integrate ASI into
the page allocator. “Sensitivty” of memory in ASI is currently all decided at
the allocation site. This means when allocating pages we need to alter the
pagetables for the restricted address space. This is a little tricky from the
page allocator:

1. In the most general case, adding pages to the restricted address space requires
   allocating pagetables. Allocating while you allocate requires some thought to
   avoid spaghetti code/deadlock risk.

2. Removing them requires a TLB flush, which can’t be done from all
   page-freeing/allocating contexts.

In the RFCs, we’ve simply kept all free pages unmapped from the restricted
address space. The allocator itself is largely unchanged; at the very end of
allocation we map pages (if appropriate), allocating pagetables via totally
separate allocation calls. When ASI-mapped pages are freed, they go onto a queue
that is then freed asynchronously from a context that’s able to batch up the TLB
flushes before making them available for re-allocation. Reclaim is then made
aware of this asynchronous process so that __GFP_DIRECT_RECLAIM allocations can
block on it where necessary.

Although we’ve been able to hammer this approach into a viable shape for the
Google workloads we’ve been concerned with so far, it’s not a general solution.
Some concrete reasons include:

a. It leads to pointless TLB shootdowns; there must be pathological cases where
   lots of pages get un-mapped only to get immediately re-allocated and mapped
   again.

b. The asynchronous worker creates CPU jitter.

v. It provides no ability to prioritise re-allocating pages with the same
   sensitivity as prior allocations. As well as TLB issues this creates page
   zeroing costs as pages that were formerly sensitive need to be zeroed before
   they can be mapped into the restricted address space.

d. This all creates unnecessary allocation latency and extra work to free pages.

At last year’s session I touched on the idea of instead using something akin to
migratetypes to track sensitivity (more accurately: presence in ASI’s restricted
pagetables) of free pages/pageblocks. The feedback on that idea was basically
“dunno, we would need more details”. I’m now working on a design based on this
approach and I’d like to use this session to go over such details. I don’t have
a prototype yet, but by March I hope to have shared some illustrative code.

Some questions I’m currently investigating that I’d like to discuss details of
(hopefully, with proposed answers by the time of the conference!):

- Can we totally avoid the need to allocate pagetables during allocation, by
  keeping ASI’s restricted copy of the physmap in-sync with the unrestricted one,
  different only in _PAGE_PRESENT?

- If not, what’s the best way to allocate while we allocate?

- When a TLB shootdown would let us satisfy an allocation that is getting into the
  deeper end of the slowpath, how is that prioritised and structured wrt. direct
  compact/reclaim/other fallbacks etc?

- How do we maintain a balance of sensitivities among free pages, and what does
  that desired balance look like?

  - (Note: if no page-table-allocation is needed to map nonsensitive pages, the
    second question goes away: since mapping is cheap but unmapping is
    expensive, we would mostly just want to minimize the number of free pages
    mapped into the restricted address space).

[0] https://lwn.net/Articles/974390/
    https://www.youtube.com/watch?v=DxaN6X_fdlI
[1] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@xxxxxxxxxx/
[2] https://lore.kernel.org/linux-mm/20240712-asi-rfc-24-v1-0-144b319a40d8@xxxxxxxxxx/