Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment

Nadav Amit <nadav.amit@xxxxxxxxx> · Thu, 27 Oct 2022 13:15:22 -0700

On Oct 27, 2022, at 11:13 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> Anybody willing to try to write up the rules (and have each rule
> document *why* it's a rule - not just "by fiat", but an actual "these
> are the rules and this is *why* they are the rules").
> 
> Because right now I think all of our rules are almost entirely just
> encoded in the code, with a couple of comments, and a few people who
> just remember why we do what we do.

I think it might be easier to come up with new rules instead of phrasing the
existing ones.

The approach I suggested before [1] is something like:

1. Turn x86’s TLB-generation mechanism to be generic. Turn the
   TLB-generation into “pending TLB-generation”.

2. For each mm track “completed TLB-generation”, whenever an actual flush
   takes place.

3. When you defer a TLB-flush, while holding the PTL:
  a. Increase the TLB-generation.
  b. Save the updated “table generation" in a new field in the
     page-table’s page-struct.

4. When you are about to rely on a PTE value that is read from a page-table,
   first check if a TLB flush is needed. The check is performed by comparing
   the “table generation” with the “completed generation”. If the “table
   generation” is behind, a TLB flush is needed.

   [ You rely on the PTE value when you install new PTEs or change them ]

That’s about it. I might have not covered some issues with fast-GUP. But in
general I think it is a simple scheme. The thing I like about this scheme
the most is that it avoids relying on almost all the OS data-structures
(e.g., PageAnon()), making it much easier to grasp.

I can revive the patch-set if the overall approach is agreeable.

[1] https://lore.kernel.org/lkml/20210131001132.3368247-1-namit@xxxxxxxxxx/