At last year’s LSF/MM/BPF I presented Address Space Isolation (ASI) [0]. ASI is a mitigation for a broad class of CPU vulnerabilities that works by creating a second “restricted” kernel address space which has “sensitive” data unmapped. If you’re unfamiliar with ASI, the first 10-15 minutes of that talk provide a broad overview of the whole system. The v1 of my RFC [2] also has some explanatory discussion in the cover letter. Last year my talk was pretty high-level, taking the temperature of the MM community about how to integrate this into the broader kernel and whether there are any major roadblocks. Since then, I’ve posted a new RFC [1] and Google’s internal implementation has continued to expand its footprint in production - it’s now a cornerstone of our CPU security strategy. Nonetheless, as noted in the RFCv2 cover-letter there are a few hurdles to overcome, at least in a proof-of-concept, before I’ll be making actual requests to merge ASI upstream. The one I’d like to talk about at this session is how to best integrate ASI into the page allocator. “Sensitivty” of memory in ASI is currently all decided at the allocation site. This means when allocating pages we need to alter the pagetables for the restricted address space. This is a little tricky from the page allocator: 1. In the most general case, adding pages to the restricted address space requires allocating pagetables. Allocating while you allocate requires some thought to avoid spaghetti code/deadlock risk. 2. Removing them requires a TLB flush, which can’t be done from all page-freeing/allocating contexts. In the RFCs, we’ve simply kept all free pages unmapped from the restricted address space. The allocator itself is largely unchanged; at the very end of allocation we map pages (if appropriate), allocating pagetables via totally separate allocation calls. When ASI-mapped pages are freed, they go onto a queue that is then freed asynchronously from a context that’s able to batch up the TLB flushes before making them available for re-allocation. Reclaim is then made aware of this asynchronous process so that __GFP_DIRECT_RECLAIM allocations can block on it where necessary. Although we’ve been able to hammer this approach into a viable shape for the Google workloads we’ve been concerned with so far, it’s not a general solution. Some concrete reasons include: a. It leads to pointless TLB shootdowns; there must be pathological cases where lots of pages get un-mapped only to get immediately re-allocated and mapped again. b. The asynchronous worker creates CPU jitter. v. It provides no ability to prioritise re-allocating pages with the same sensitivity as prior allocations. As well as TLB issues this creates page zeroing costs as pages that were formerly sensitive need to be zeroed before they can be mapped into the restricted address space. d. This all creates unnecessary allocation latency and extra work to free pages. At last year’s session I touched on the idea of instead using something akin to migratetypes to track sensitivity (more accurately: presence in ASI’s restricted pagetables) of free pages/pageblocks. The feedback on that idea was basically “dunno, we would need more details”. I’m now working on a design based on this approach and I’d like to use this session to go over such details. I don’t have a prototype yet, but by March I hope to have shared some illustrative code. Some questions I’m currently investigating that I’d like to discuss details of (hopefully, with proposed answers by the time of the conference!): - Can we totally avoid the need to allocate pagetables during allocation, by keeping ASI’s restricted copy of the physmap in-sync with the unrestricted one, different only in _PAGE_PRESENT? - If not, what’s the best way to allocate while we allocate? - When a TLB shootdown would let us satisfy an allocation that is getting into the deeper end of the slowpath, how is that prioritised and structured wrt. direct compact/reclaim/other fallbacks etc? - How do we maintain a balance of sensitivities among free pages, and what does that desired balance look like? - (Note: if no page-table-allocation is needed to map nonsensitive pages, the second question goes away: since mapping is cheap but unmapping is expensive, we would mostly just want to minimize the number of free pages mapped into the restricted address space). [0] https://lwn.net/Articles/974390/ https://www.youtube.com/watch?v=DxaN6X_fdlI [1] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@xxxxxxxxxx/ [2] https://lore.kernel.org/linux-mm/20240712-asi-rfc-24-v1-0-144b319a40d8@xxxxxxxxxx/