Re: [LSF/MM/BPF TOPIC] ASI's page cache problem

Brendan Jackman <jackmanb@xxxxxxxxxx> · Tue, 11 Feb 2025 19:05:03 +0100

Adding new recipients so I'll quote the whole mail...

On Wed, 29 Jan 2025 at 15:43, Brendan Jackman <jackmanb@xxxxxxxxxx> wrote:
>
> This is a "lower priority" topic that I would like to discuss if there are
> unused slots, but it shouldn't be scheduled in favour of other sessions as I
> have not started properly researching it and I don't expect to be very well
> prepared by the time of the conference. You'll get an idea of this lack of
> research from the vague hand-wavy ideas discussed at the bottom of this mail.
>
> My main topic proposal is [0], there's more context about ASI there.
>
> In the RFCv2 [1] for ASI, I added the capability to halt attacks from malicious
> bare-metal processes. This is the major missing piece that's required before ASI
> offers a way for maintainers stop developing bespoke per-CPU-vuln mitigations.
> As I discussed in the cover letter though, it exposes us to a major performance
> issue: we have no way to map file pages into the restricted address space.
>
> This means that whenever a process accesses a file via read(), an ASI page fault
> is triggered when the kernel accesses the page in the direct map. This is very
> expensive: I measured a 70% degradation on a 4k fio randread benchmark. And it's
> totally pointless as the process is about to get architectural access to the
> data we are "protecting" anyway.
>
> The basic issue at play here is that ASI decides on "sensitivity" (whether to
> map into the restricted address space) at allocation time, but whether ASI is
> required to protect file data from a given process is not generally known at
> the time when the physical pages that will hold it are allocated.
>
> The most obvious direction to search is a solution that maps these file pages at
> some later time, and ensure they are unmapped before the process loses logical
> access to the data. For this session I'd like to discuss ways of doing that
> without creating intolerably TLB management pain.
>
> The physmap is global, but "sensitivity" of file data is obviously relative to
> the process that wants to access it. Thus this addition to the restricted
> address space has to be process-local. I haven't properly explored it, but I
> suspect mixing global and local elements together in the  ASI physmap is not
> practical.
>
> (Note: Junaid's earlier ASI RFC [2] included support for process-local
> sensitivity but still required deciding on the sensitivity at allocation time).
>
> So two ideas come to mind:
>
> - Create a new process-local vmalloc-like area, where file pages can be mapped
>   as the process gains access to the underlying file.
>
>   I don't yet have a mental picture of whether this is possible without creating
>   overheads that grow linearly with the number of processes that can read a
>   file, or how bad such overheads would be.

Feedback from Dave Hansen was that this sounds pretty hard. But it
remains to be explored

> - Create a new CPU-local region of the kernel address space. When reading file
>   pages, ephemerally map them into this region with preemption off, and tear
>   down these mappings before re-enabling preemption. Since they are CPU-local,
>   that teardown requires no cross-thread communication and "should be pretty
>   fast".
>
>   At best, this means incurring a TLB miss on every file access; I don't know
>   how bad that would be. I also don't know how costly it would be to create
>   per-CPU virtual memory regions (meaning the PGD must be per-CPU).

Dave also made me aware that per-CPU PGDs has been discussed several
times in the past for various reasons, like getting rid of the GS
percpu magic and solving that with paging.

He said that it's _probably_ a nonstarter.

Some other ideas that have come up in interesting conversations here:

- Yosry suggested that we use a DMA engine to copy the file data while
avoiding touching it through the CPU's MMU. Reiji expressed some
doubts about whether that's gonna be any less costly than an
asi_exit(), but it could definitely be worth exploring.

- Reiji pointed out that for small reads we might consider just using
a non-cacheable mapping. We generally believe it's fine for arbitrary
non-cachable mappings to be in the ASI restricted address space, so
there might be an extremely simple solution here if the performance
characteristics are right.

- Newer hardware has features that might massively alleviate the
problems with tearing down ephemeral mappings:

  - (Something something protection keys... I haven't really thought
this through)

  - HW support for remote TLB invalidations might make things pretty fast.

   If we were willing to say "ASI is only fast on newer hardware" this
could be an interesting line of attack. I've personally expressed some
resistance to this idea and said stuff like "no, it has to be fast on
Skylake", but maybe it doesn't, if we think we've already found and
mitigated all the Skylake bugs.

 - I am also now pondering if the ephemeral mappings I discussed above
would actually have to be CPU-local. That idea was about avoiding
remote TLB flushes, but actually we only need to flush the TLB at the
first instance when it's possible the process lost access to the file.
In combination with invalidation flags and TLB generations, maybe this
can mean "effectivel never".

So overall, plenty more research required but there are a lot of
useful avenues to explore here, I'm feeling optimistic again.

> And I'd like to discuss:
>
> - Reasons people might see why these ideas are total non-starters.
>
> - Totally different ideas for solving the page-cache issue.
>
> - Other problems that might overlap with this one, and benefit from some new
>   shared virtual memory facility.
>
> [0] https://lore.kernel.org/all/20250129124034.2612562-1-jackmanb@xxxxxxxxxx/
>
> [1] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@xxxxxxxxxx/
>
> [2] https://lore.kernel.org/all/20220223052223.1202152-1-junaids@xxxxxxxxxx/