[LSF/MM/BPF TOPIC] ASI's page cache problem

Brendan Jackman <jackmanb@xxxxxxxxxx> · Wed, 29 Jan 2025 14:43:20 +0000

This is a "lower priority" topic that I would like to discuss if there are
unused slots, but it shouldn't be scheduled in favour of other sessions as I
have not started properly researching it and I don't expect to be very well
prepared by the time of the conference. You'll get an idea of this lack of
research from the vague hand-wavy ideas discussed at the bottom of this mail.

My main topic proposal is [0], there's more context about ASI there.

In the RFCv2 [1] for ASI, I added the capability to halt attacks from malicious
bare-metal processes. This is the major missing piece that's required before ASI
offers a way for maintainers stop developing bespoke per-CPU-vuln mitigations.
As I discussed in the cover letter though, it exposes us to a major performance
issue: we have no way to map file pages into the restricted address space.

This means that whenever a process accesses a file via read(), an ASI page fault
is triggered when the kernel accesses the page in the direct map. This is very
expensive: I measured a 70% degradation on a 4k fio randread benchmark. And it's
totally pointless as the process is about to get architectural access to the
data we are "protecting" anyway.

The basic issue at play here is that ASI decides on "sensitivity" (whether to
map into the restricted address space) at allocation time, but whether ASI is
required to protect file data from a given process is not generally known at
the time when the physical pages that will hold it are allocated.

The most obvious direction to search is a solution that maps these file pages at
some later time, and ensure they are unmapped before the process loses logical
access to the data. For this session I'd like to discuss ways of doing that
without creating intolerably TLB management pain.

The physmap is global, but "sensitivity" of file data is obviously relative to
the process that wants to access it. Thus this addition to the restricted
address space has to be process-local. I haven't properly explored it, but I
suspect mixing global and local elements together in the  ASI physmap is not
practical.

(Note: Junaid's earlier ASI RFC [2] included support for process-local
sensitivity but still required deciding on the sensitivity at allocation time).

So two ideas come to mind:

- Create a new process-local vmalloc-like area, where file pages can be mapped
  as the process gains access to the underlying file.

  I don't yet have a mental picture of whether this is possible without creating
  overheads that grow linearly with the number of processes that can read a
  file, or how bad such overheads would be.

- Create a new CPU-local region of the kernel address space. When reading file
  pages, ephemerally map them into this region with preemption off, and tear
  down these mappings before re-enabling preemption. Since they are CPU-local,
  that teardown requires no cross-thread communication and "should be pretty
  fast".

  At best, this means incurring a TLB miss on every file access; I don't know
  how bad that would be. I also don't know how costly it would be to create
  per-CPU virtual memory regions (meaning the PGD must be per-CPU).

And I'd like to discuss:

- Reasons people might see why these ideas are total non-starters.

- Totally different ideas for solving the page-cache issue.

- Other problems that might overlap with this one, and benefit from some new
  shared virtual memory facility.

[0] https://lore.kernel.org/all/20250129124034.2612562-1-jackmanb@xxxxxxxxxx/

[1] https://lore.kernel.org/linux-mm/20250110-asi-rfc-v2-v2-0-8419288bc805@xxxxxxxxxx/

[2] https://lore.kernel.org/all/20220223052223.1202152-1-junaids@xxxxxxxxxx/