On Wed, 29 Jan 2025 at 13:40, Brendan Jackman <jackmanb@xxxxxxxxxx> wrote: > > At last year’s LSF/MM/BPF I presented Address Space Isolation (ASI) [0]. ASI is > a mitigation for a broad class of CPU vulnerabilities that works by creating a > second “restricted” kernel address space which has “sensitive” data unmapped. If > you’re unfamiliar with ASI, the first 10-15 minutes of that talk provide a broad > overview of the whole system. The v1 of my RFC [2] also has some explanatory > discussion in the cover letter. > > Last year my talk was pretty high-level, taking the temperature of the MM > community about how to integrate this into the broader kernel and whether there > are any major roadblocks. > > Since then, I’ve posted a new RFC [1] and Google’s internal implementation has > continued to expand its footprint in production - it’s now a cornerstone of our > CPU security strategy. Nonetheless, as noted in the RFCv2 cover-letter there are > a few hurdles to overcome, at least in a proof-of-concept, before I’ll be making > actual requests to merge ASI upstream. > > The one I’d like to talk about at this session is how to best integrate ASI into > the page allocator. “Sensitivty” of memory in ASI is currently all decided at > the allocation site. This means when allocating pages we need to alter the > pagetables for the restricted address space. This is a little tricky from the > page allocator: > > 1. In the most general case, adding pages to the restricted address space requires > allocating pagetables. Allocating while you allocate requires some thought to > avoid spaghetti code/deadlock risk. > > 2. Removing them requires a TLB flush, which can’t be done from all > page-freeing/allocating contexts. > > In the RFCs, we’ve simply kept all free pages unmapped from the restricted > address space. The allocator itself is largely unchanged; at the very end of > allocation we map pages (if appropriate), allocating pagetables via totally > separate allocation calls. When ASI-mapped pages are freed, they go onto a queue > that is then freed asynchronously from a context that’s able to batch up the TLB > flushes before making them available for re-allocation. Reclaim is then made > aware of this asynchronous process so that __GFP_DIRECT_RECLAIM allocations can > block on it where necessary. > > Although we’ve been able to hammer this approach into a viable shape for the > Google workloads we’ve been concerned with so far, it’s not a general solution. > Some concrete reasons include: > > a. It leads to pointless TLB shootdowns; there must be pathological cases where > lots of pages get un-mapped only to get immediately re-allocated and mapped > again. > > b. The asynchronous worker creates CPU jitter. > > v. It provides no ability to prioritise re-allocating pages with the same > sensitivity as prior allocations. As well as TLB issues this creates page > zeroing costs as pages that were formerly sensitive need to be zeroed before > they can be mapped into the restricted address space. > > d. This all creates unnecessary allocation latency and extra work to free pages. > > At last year’s session I touched on the idea of instead using something akin to > migratetypes to track sensitivity (more accurately: presence in ASI’s restricted > pagetables) of free pages/pageblocks. The feedback on that idea was basically > “dunno, we would need more details”. I’m now working on a design based on this > approach and I’d like to use this session to go over such details. I don’t have > a prototype yet, but by March I hope to have shared some illustrative code. > > Some questions I’m currently investigating that I’d like to discuss details of > (hopefully, with proposed answers by the time of the conference!): > > - Can we totally avoid the need to allocate pagetables during allocation, by > keeping ASI’s restricted copy of the physmap in-sync with the unrestricted one, > different only in _PAGE_PRESENT? > > - If not, what’s the best way to allocate while we allocate? > > - When a TLB shootdown would let us satisfy an allocation that is getting into the > deeper end of the slowpath, how is that prioritised and structured wrt. direct > compact/reclaim/other fallbacks etc? > > - How do we maintain a balance of sensitivities among free pages, and what does > that desired balance look like? > > - (Note: if no page-table-allocation is needed to map nonsensitive pages, the > second question goes away: since mapping is cheap but unmapping is > expensive, we would mostly just want to minimize the number of free pages > mapped into the restricted address space). > > [0] https://lwn.net/Articles/974390/ > https://www.youtube.com/watch?v=DxaN6X_fdlI > [1] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@xxxxxxxxxx/ > [2] https://lore.kernel.org/linux-mm/20240712-asi-rfc-24-v1-0-144b319a40d8@xxxxxxxxxx/ Hmm, I did not CC anyone except the list. Adding some people in case it prompts a discussion.