On Fri, Aug 18, 2023 at 01:19:34PM +0200, Fabio M. De Francesco wrote: > Extend page_tables.rst by adding a section about the role of MMU and TLB > in translating between virtual addresses and physical page frames. > Furthermore explain the concept behind Page Faults and how the Linux > kernel handles TLB misses. Finally briefly explain how and why to disable > the page faults handler. > > Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> > Cc: Ira Weiny <ira.weiny@xxxxxxxxx> > Cc: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> > Cc: Jonathan Corbet <corbet@xxxxxxx> > Cc: Linus Walleij <linus.walleij@xxxxxxxxxx> > Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Cc: Mike Rapoport <rppt@xxxxxxxxxx> > Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx> > Reviewed-by: Linus Walleij <linus.walleij@xxxxxxxxxx> > Signed-off-by: Fabio M. De Francesco <fmdefrancesco@xxxxxxxxx> Acked-by: Mike Rapoport (IBM) <rppt@xxxxxxxxxx> > --- > > v2 -> v3: This version fixes the grammar mistakes found by Linus > and forwards his "Reviewed-by" tag (thanks!). > https://lore.kernel.org/all/CACRpkdbq8UCtvtRH7FZUEqvTxPQcoGbrKvf_mT5QHMAfVoYNNQ@xxxxxxxxxxxxxx/ > > v1 -> v2: This version takes into account the comments provided by Mike > (thanks!). I hope I haven't overlooked anything he suggested :-) > https://lore.kernel.org/all/20230807105010.GK2607694@xxxxxxxxxx/ > > Furthermore, v2 adds few more information about swapping which was not present > in v1. > > before the "real" patch, this has been an RFC PATCH in its 2nd version for a week > or so until I received comments and suggestions from Jonathan Cameron (thanks!), > and then it morphed to a real patch. > > The link to the thread with the RFC PATCH v2 and the messages between Jonathan > and me start at https://lore.kernel.org/all/20230723120721.7139-1-fmdefrancesco@xxxxxxxxx/#r > > > Documentation/mm/page_tables.rst | 127 +++++++++++++++++++++++++++++++ > 1 file changed, 127 insertions(+) > > diff --git a/Documentation/mm/page_tables.rst b/Documentation/mm/page_tables.rst > index 7840c1891751..be47b192a596 100644 > --- a/Documentation/mm/page_tables.rst > +++ b/Documentation/mm/page_tables.rst > @@ -152,3 +152,130 @@ Page table handling code that wishes to be architecture-neutral, such as the > virtual memory manager, will need to be written so that it traverses all of the > currently five levels. This style should also be preferred for > architecture-specific code, so as to be robust to future changes. > + > + > +MMU, TLB, and Page Faults > +========================= > + > +The `Memory Management Unit (MMU)` is a hardware component that handles virtual > +to physical address translations. It may use relatively small caches in hardware > +called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up > +these translations. > + > +When CPU accesses a memory location, it provides a virtual address to the MMU, > +which checks if there is the existing translation in the TLB or in the Page > +Walk Caches (on architectures that support them). If no translation is found, > +MMU uses the page walks to determine the physical address and create the map. > + > +The dirty bit for a page is set (i.e., turned on) when the page is written to. > +Each page of memory has associated permission and dirty bits. The latter > +indicate that the page has been modified since it was loaded into memory. > + > +If nothing prevents it, eventually the physical memory can be accessed and the > +requested operation on the physical frame is performed. > + > +There are several reasons why the MMU can't find certain translations. It could > +happen because the CPU is trying to access memory that the current task is not > +permitted to, or because the data is not present into physical memory. > + > +When these conditions happen, the MMU triggers page faults, which are types of > +exceptions that signal the CPU to pause the current execution and run a special > +function to handle the mentioned exceptions. > + > +There are common and expected causes of page faults. These are triggered by > +process management optimization techniques called "Lazy Allocation" and > +"Copy-on-Write". Page faults may also happen when frames have been swapped out > +to persistent storage (swap partition or file) and evicted from their physical > +locations. > + > +These techniques improve memory efficiency, reduce latency, and minimize space > +occupation. This document won't go deeper into the details of "Lazy Allocation" > +and "Copy-on-Write" because these subjects are out of scope as they belong to > +Process Address Management. > + > +Swapping differentiates itself from the other mentioned techniques because it's > +undesirable since it's performed as a means to reduce memory under heavy > +pressure. > + > +Swapping can't work for memory mapped by kernel logical addresses. These are a > +subset of the kernel virtual space that directly maps a contiguous range of > +physical memory. Given any logical address, its physical address is determined > +with simple arithmetic on an offset. Accesses to logical addresses are fast > +because they avoid the need for complex page table lookups at the expenses of > +frames not being evictable and pageable out. > + > +If the kernel fails to make room for the data that must be present in the > +physical frames, the kernel invokes the out-of-memory (OOM) killer to make room > +by terminating lower priority processes until pressure reduces under a safe > +threshold. > + > +Additionally, page faults may be also caused by code bugs or by maliciously > +crafted addresses that the CPU is instructed to access. A thread of a process > +could use instructions to address (non-shared) memory which does not belong to > +its own address space, or could try to execute an instruction that want to write > +to a read-only location. > + > +If the above-mentioned conditions happen in user-space, the kernel sends a > +`Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually > +causes the termination of the thread and of the process it belongs to. > + > +This document is going to simplify and show an high altitude view of how the > +Linux kernel handles these page faults, creates tables and tables' entries, > +check if memory is present and, if not, requests to load data from persistent > +storage or from other devices, and updates the MMU and its caches. > + > +The first steps are architecture dependent. Most architectures jump to > +`do_page_fault()`, whereas the x86 interrupt handler is defined by the > +`DEFINE_IDTENTRY_RAW_ERRORCODE()` macro which calls `handle_page_fault()`. > + > +Whatever the routes, all architectures end up to the invocation of > +`handle_mm_fault()` which, in turn, (likely) ends up calling > +`__handle_mm_fault()` to carry out the actual work of allocating the page > +tables. > + > +The unfortunate case of not being able to call `__handle_mm_fault()` means > +that the virtual address is pointing to areas of physical memory which are not > +permitted to be accessed (at least from the current context). This > +condition resolves to the kernel sending the above-mentioned SIGSEGV signal > +to the process and leads to the consequences already explained. > + > +`__handle_mm_fault()` carries out its work by calling several functions to > +find the entry's offsets of the upper layers of the page tables and allocate > +the tables that it may need. > + > +The functions that look for the offset have names like `*_offset()`, where the > +"*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the > +corresponding tables, layer by layer, are called `*_alloc`, using the > +above-mentioned convention to name them after the corresponding types of tables > +in the hierarchy. > + > +The page table walk may end at one of the middle or upper layers (PMD, PUD). > + > +Linux supports larger page sizes than the usual 4KB (i.e., the so called > +`huge pages`). When using these kinds of larger pages, higher level pages can > +directly map them, with no need to use lower level page entries (PTE). Huge > +pages contain large contiguous physical regions that usually span from 2MB to > +1GB. They are respectively mapped by the PMD and PUD page entries. > + > +The huge pages bring with them several benefits like reduced TLB pressure, > +reduced page table overhead, memory allocation efficiency, and performance > +improvement for certain workloads. However, these benefits come with > +trade-offs, like wasted memory and allocation challenges. > + > +At the very end of the walk with allocations, if it didn't return errors, > +`__handle_mm_fault()` finally calls `handle_pte_fault()`, which via `do_fault()` > +performs one of `do_read_fault()`, `do_cow_fault()`, `do_shared_fault()`. > +"read", "cow", "shared" give hints about the reasons and the kind of fault it's > +handling. > + > +The actual implementation of the workflow is very complex. Its design allows > +Linux to handle page faults in a way that is tailored to the specific > +characteristics of each architecture, while still sharing a common overall > +structure. > + > +To conclude this high altitude view of how Linux handles page faults, let's > +add that the page faults handler can be disabled and enabled respectively with > +`pagefault_disable()` and `pagefault_enable()`. > + > +Several code path make use of the latter two functions because they need to > +disable traps into the page faults handler, mostly to prevent deadlocks. > -- > 2.41.0 > -- Sincerely yours, Mike.