On venerdì 28 luglio 2023 13:53:01 CEST Fabio M. De Francesco wrote: > Extend page_tables.rst by adding a section about the role of MMU and TLB > in translating between virtual addresses and physical page frames. > Furthermore explain the concept behind Page Faults and how the Linux > kernel handles TLB misses. Finally briefly explain how and why to disable > the page faults handler. Hello everyone, I'd be grateful to anyone who wanted to comment on / or formally review this patch. At the moment I've only had comments by Jonathan Cameron on RFC v2 (https://lore.kernel.org/all/20230723120721.7139-1-fmdefrancesco@xxxxxxxxx/ #t). Does anybody else want to contribute? Thanks in advance, Fabio > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Ira Weiny <ira.weiny@xxxxxxxxx> > Cc: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> > Cc: Jonathan Corbet <corbet@xxxxxxx> > Cc: Linus Walleij <linus.walleij@xxxxxxxxxx> > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Cc: Mike Rapoport <rppt@xxxxxxxxxx> > Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx> > Signed-off-by: Fabio M. De Francesco <fmdefrancesco@xxxxxxxxx> > --- > > This has been an RFC PATCH in its 2nd version for a week or so. I received > comments and suggestions on it from Jonathan Cameron (thanks!), and so it has > now been modified to a real patch. I hope that other people want to add their > comments on this document in order to further improve and extend it. > > The link to the thread with the RFC PATCH v2 and the messages between Jonathan > and me start at > https://lore.kernel.org/all/20230723120721.7139-1-fmdefrancesco@xxxxxxxxx/#r > > Documentation/mm/page_tables.rst | 105 +++++++++++++++++++++++++++++++ > 1 file changed, 105 insertions(+) > > diff --git a/Documentation/mm/page_tables.rst > b/Documentation/mm/page_tables.rst index 7840c1891751..6ecfd6d2f1f3 100644 > --- a/Documentation/mm/page_tables.rst > +++ b/Documentation/mm/page_tables.rst > @@ -152,3 +152,108 @@ Page table handling code that wishes to be > architecture-neutral, such as the virtual memory manager, will need to be > written so that it traverses all of the currently five levels. This style > should also be preferred for > architecture-specific code, so as to be robust to future changes. > + > + > +MMU, TLB, and Page Faults > +========================= > + > +The `Memory Management Unit (MMU)` is a hardware component that handles > virtual +to physical address translations. It may use relatively small caches > in hardware +called `Translation Lookaside Buffers (TLBs)` and `Page Walk > Caches` to speed up +these translations. > + > +When a process wants to access a memory location, the CPU provides a virtual > +address to the MMU, which then uses the MMU to check access permissions and > +dirty bits, and if possible it resolves the physical address and consents the > +requested type of access to the corresponding physical address. > + > +If the TLBs have not yet any recorded translations, the MMU may use the Page > +Walk Caches and complete or restart the page tables walks until a physical > +address can finally be resolved. Permissions and dirty bits are checked. > + > +In the context of a virtual memory system, like the one used by the Linux > +kernel, each page of memory has associated permission and dirty bits. > + > +The dirty bit for a page is set (i.e., turned on) when the page is written > +to. This indicates that the page has been modified since it was loaded into > +memory. It probably needs to be written on disk or other cores may need to > +be informed about previous changes before allowing further operations. > + > +If nothing prevents it, eventually the physical memory can be accessed and > +the requested operation on the physical frame is performed. > + > +There are several reasons why the MMU can't find certain translations. It > +could happen because the process is trying to access a range of memory that > is +not allowed to, or because the data is not present into RAM. > + > +When these conditions happen, the MMU triggers page faults, which are types > +of exceptions that signal the CPU to pause the current process and run a > special +function to handle the mentioned page faults. > + > +One cause of page faults is due to bugs (or maliciously crafted addresses) > and +happens when a process tries to access a range of memory that it doesn't > have +permission to. This could be because the memory is reserved for the > kernel or +for another process, or because the process is trying to write to > a read-only +section of memory. When this happens, the kernel sends a > Segmentation Fault +(SIGSEGV) signal to the process, which usually causes the > process to terminate. + > +An expected and more common cause of page faults is an optimization called > "lazy +allocation". This is a technique used by the Kernel to improve memory > efficiency +and reduce footprint. Instead of allocating physical memory to a > process as soon +as it's requested, the Kernel waits until the process > actually tries to use the +memory. This can save a significant amount of > memory in cases where a process +requests a large block but only uses a small > portion of it. > + > +A related technique is called "Copy-on-Write" (CoW), where the Kernel allows > +multiple processes to share the same physical memory as long as they're only > +reading from it. If a process tries to write to the shared memory, the kernel > +triggers a page fault and allocates a separate copy of the memory for the > +process. This allows the Kernel to save memory and avoid unnecessary data > +copying and, by doing so, it reduces latency and space occupation. > + > +Now, let's see how the Linux kernel handles these page faults: > + > +1. For most architectures, `do_page_fault()` is the primary interrupt handler > + for page faults. It delegates the actual handling of the page fault to + > `handle_mm_fault()`. This function checks the cause of the page fault and + > takes the appropriate action, such as loading the required page into + > memory, granting the process the necessary permissions, or sending a + > SIGSEGV signal to the process. > + > +2. In the specific case of the x86 architecture, the interrupt handler is > + defined by the `DEFINE_IDTENTRY_RAW_ERRORCODE()` macro, which calls > + `handle_page_fault()`. This function then calls either > + `do_user_addr_fault()` or `do_kern_addr_fault()`, depending on whether > + the fault occurred in user space or kernel space. Both of these functions > + eventually lead to `handle_mm_fault()`, similar to the workflow in other > + architectures. > + > +`handle_mm_fault()` (likely) ends up calling `__handle_mm_fault()` to carry > +out the actual work of allocation of the page tables. It works by using > +several functions to find the entry's offsets of the 4 - 5 layers of tables > +and allocate the tables it needs to. The functions that look for the offset > +have names like `*_offset()`, where the "*" is for pgd, p4d, pud, pmd, pte; > +instead the functions to allocate the corresponding tables, layer by layer, > +are named `*_alloc`, with the above mentioned convention to name them after > +the corresponding types of tables in the hierarchy. > + > +At the very end of the walk with allocations, if it didn't return errors, > +`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via > +`do_fault()` performs one of `do_read_fault()`, `do_cow_fault()`, > +`do_shared_fault()`. "read", "cow", "shared" give hints about the reasons > +and the kind of fault it's handling. > + > +The actual implementation of the workflow is very complex. Its design allows > +Linux to handle page faults in a way that is tailored to the specific > +characteristics of each architecture, while still sharing a common overall > +structure. > + > +To conclude this brief overview from very high altitude of how Linux handles > +page faults, let's add that page faults handler can be disabled and enabled > +respectively with `pagefault_disable()` and `pagefault_enable()`. > + > +Several code path make use of the latter two functions because they need to > +disable traps into the page faults handler, mostly to prevent deadlocks.[1] > + > +[1] mm/userfaultfd: Replace kmap/kmap_atomic() with kmap_local_page() > +https://lore.kernel.org/all/20221025220136.2366143-1-ira.weiny@xxxxxxxxx/ > -- > 2.41.0