Re: [LSF/MM/BPF TOPIC] Page allocation for ASI

Brendan Jackman <jackmanb@xxxxxxxxxx> · Wed, 29 Jan 2025 17:35:29 +0100

On Wed, 29 Jan 2025 at 13:40, Brendan Jackman <jackmanb@xxxxxxxxxx> wrote:
>
> At last year’s LSF/MM/BPF I presented Address Space Isolation (ASI) [0]. ASI is
> a mitigation for a broad class of CPU vulnerabilities that works by creating a
> second “restricted” kernel address space which has “sensitive” data unmapped. If
> you’re unfamiliar with ASI, the first 10-15 minutes of that talk provide a broad
> overview of the whole system. The v1 of my RFC [2] also has some explanatory
> discussion in the cover letter.
>
> Last year my talk was pretty high-level, taking the temperature of the MM
> community about how to integrate this into the broader kernel and whether there
> are any major roadblocks.
>
> Since then, I’ve posted a new RFC [1] and Google’s internal implementation has
> continued to expand its footprint in production - it’s now a cornerstone of our
> CPU security strategy. Nonetheless, as noted in the RFCv2 cover-letter there are
> a few hurdles to overcome, at least in a proof-of-concept, before I’ll be making
> actual requests to merge ASI upstream.
>
> The one I’d like to talk about at this session is how to best integrate ASI into
> the page allocator. “Sensitivty” of memory in ASI is currently all decided at
> the allocation site. This means when allocating pages we need to alter the
> pagetables for the restricted address space. This is a little tricky from the
> page allocator:
>
> 1. In the most general case, adding pages to the restricted address space requires
>    allocating pagetables. Allocating while you allocate requires some thought to
>    avoid spaghetti code/deadlock risk.
>
> 2. Removing them requires a TLB flush, which can’t be done from all
>    page-freeing/allocating contexts.
>
> In the RFCs, we’ve simply kept all free pages unmapped from the restricted
> address space. The allocator itself is largely unchanged; at the very end of
> allocation we map pages (if appropriate), allocating pagetables via totally
> separate allocation calls. When ASI-mapped pages are freed, they go onto a queue
> that is then freed asynchronously from a context that’s able to batch up the TLB
> flushes before making them available for re-allocation. Reclaim is then made
> aware of this asynchronous process so that __GFP_DIRECT_RECLAIM allocations can
> block on it where necessary.
>
> Although we’ve been able to hammer this approach into a viable shape for the
> Google workloads we’ve been concerned with so far, it’s not a general solution.
> Some concrete reasons include:
>
> a. It leads to pointless TLB shootdowns; there must be pathological cases where
>    lots of pages get un-mapped only to get immediately re-allocated and mapped
>    again.
>
> b. The asynchronous worker creates CPU jitter.
>
> v. It provides no ability to prioritise re-allocating pages with the same
>    sensitivity as prior allocations. As well as TLB issues this creates page
>    zeroing costs as pages that were formerly sensitive need to be zeroed before
>    they can be mapped into the restricted address space.
>
> d. This all creates unnecessary allocation latency and extra work to free pages.
>
> At last year’s session I touched on the idea of instead using something akin to
> migratetypes to track sensitivity (more accurately: presence in ASI’s restricted
> pagetables) of free pages/pageblocks. The feedback on that idea was basically
> “dunno, we would need more details”. I’m now working on a design based on this
> approach and I’d like to use this session to go over such details. I don’t have
> a prototype yet, but by March I hope to have shared some illustrative code.
>
> Some questions I’m currently investigating that I’d like to discuss details of
> (hopefully, with proposed answers by the time of the conference!):
>
> - Can we totally avoid the need to allocate pagetables during allocation, by
>   keeping ASI’s restricted copy of the physmap in-sync with the unrestricted one,
>   different only in _PAGE_PRESENT?
>
> - If not, what’s the best way to allocate while we allocate?
>
> - When a TLB shootdown would let us satisfy an allocation that is getting into the
>   deeper end of the slowpath, how is that prioritised and structured wrt. direct
>   compact/reclaim/other fallbacks etc?
>
> - How do we maintain a balance of sensitivities among free pages, and what does
>   that desired balance look like?
>
>   - (Note: if no page-table-allocation is needed to map nonsensitive pages, the
>     second question goes away: since mapping is cheap but unmapping is
>     expensive, we would mostly just want to minimize the number of free pages
>     mapped into the restricted address space).
>
> [0] https://lwn.net/Articles/974390/
>     https://www.youtube.com/watch?v=DxaN6X_fdlI
> [1] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@xxxxxxxxxx/
> [2] https://lore.kernel.org/linux-mm/20240712-asi-rfc-24-v1-0-144b319a40d8@xxxxxxxxxx/

Hmm, I did not CC anyone except the list. Adding some people in case
it prompts a discussion.