On 05/29/2018 04:37 AM, Mike Rapoport wrote: > Hi, > > From 2d3ec7ea101a66b1535d5bec4acfc1e0f737fd53 Mon Sep 17 00:00:00 2001 > From: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx> > Date: Tue, 29 May 2018 14:12:39 +0300 > Subject: [PATCH] docs/admin-guide/mm: add high level concepts overview > > The are terms that seem obvious to the mm developers, but may be somewhat There are [or: These are] > obscure for, say, less involved readers. > > The concepts overview can be seen as an "extended glossary" that introduces > such terms to the readers of the kernel documentation. > > Signed-off-by: Mike Rapoport <rppt@xxxxxxxxxxxxxxxxxx> > --- > Documentation/admin-guide/mm/concepts.rst | 222 ++++++++++++++++++++++++++++++ > Documentation/admin-guide/mm/index.rst | 5 + > 2 files changed, 227 insertions(+) > create mode 100644 Documentation/admin-guide/mm/concepts.rst > > diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst > new file mode 100644 > index 0000000..291699c > --- /dev/null > +++ b/Documentation/admin-guide/mm/concepts.rst > @@ -0,0 +1,222 @@ > +.. _mm_concepts: > + > +================= > +Concepts overview > +================= > + > +The memory management in Linux is complex system that evolved over the is a complex > +years and included more and more functionality to support variety of support a variety of > +systems from MMU-less microcontrollers to supercomputers. The memory > +management for systems without MMU is called ``nommu`` and it without an MMU > +definitely deserves a dedicated document, which hopefully will be > +eventually written. Yet, although some of the concepts are the same, > +here we assume that MMU is available and CPU can translate a virtual that an MMU and a CPU > +address to a physical address. > + > +.. contents:: :local: > + > +Virtual Memory Primer > +===================== > + > +The physical memory in a computer system is a limited resource and > +even for systems that support memory hotplug there is a hard limit on > +the amount of memory that can be installed. The physical memory is not > +necessary contiguous, it might be accessible as a set of distinct Change comma to semi-colon or period (and if latter, s/it/It/). > +address ranges. Besides, different CPU architectures, and even > +different implementations of the same architecture have different view views of > +how these address ranges defined. > + > +All this makes dealing directly with physical memory quite complex and > +to avoid this complexity a concept of virtual memory was developed. > + > +The virtual memory abstracts the details of physical memory from the virtual memory {system, implementation} abstracts > +application software, allows to keep only needed information in the software, allowing the VM to keep only needed information in the > +physical memory (demand paging) and provides a mechanism for the > +protection and controlled sharing of data between processes. > + > +With virtual memory, each and every memory access uses a virtual > +address. When the CPU decodes the an instruction that reads (or > +writes) from (or to) the system memory, it translates the `virtual` > +address encoded in that instruction to a `physical` address that the > +memory controller can understand. > + > +The physical system memory is divided into page frames, or pages. The > +size of each page is architecture specific. Some architectures allow > +selection of the page size from several supported values; this > +selection is performed at the kernel build time by setting an > +appropriate kernel configuration option. > + > +Each physical memory page can be mapped as one or more virtual > +pages. These mappings are described by page tables that allow > +translation from virtual address used by programs to real address in from a virtual address to {a, the} real address in > +the physical memory. The page tables organized hierarchically. tables are organized > + > +The tables at the lowest level of the hierarchy contain physical > +addresses of actual pages used by the software. The tables at higher > +levels contain physical addresses of the pages belonging to the lower > +levels. The pointer to the top level page table resides in a > +register. When the CPU performs the address translation, it uses this > +register to access the top level page table. The high bits of the > +virtual address are used to index an entry in the top level page > +table. That entry is then used to access the next level in the > +hierarchy with the next bits of the virtual address as the index to > +that level page table. The lowest bits in the virtual address define > +the offset inside the actual page. > + > +Huge Pages > +========== > + > +The address translation requires several memory accesses and memory > +accesses are slow relatively to CPU speed. To avoid spending precious > +processor cycles on the address translation, CPUs maintain a cache of > +such translations called Translation Lookaside Buffer (or > +TLB). Usually TLB is pretty scarce resource and applications with > +large memory working set will experience performance hit because of > +TLB misses. > + > +Many modern CPU architectures allow mapping of the memory pages > +directly by the higher levels in the page table. For instance, on x86, > +it is possible to map 2M and even 1G pages using entries in the second > +and the third level page tables. In Linux such pages are called > +`huge`. Usage of huge pages significantly reduces pressure on TLB, > +improves TLB hit-rate and thus improves overall system performance. > + > +There are two mechanisms in Linux that enable mapping of the physical > +memory with the huge pages. The first one is `HugeTLB filesystem`, or > +hugetlbfs. It is a pseudo filesystem that uses RAM as its backing > +store. For the files created in this filesystem the data resides in > +the memory and mapped using huge pages. The hugetlbfs is described at > +:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`. > + > +Another, more recent, mechanism that enables use of the huge pages is > +called `Transparent HugePages`, or THP. Unlike the hugetlbfs that > +requires users and/or system administrators to configure what parts of > +the system memory should and can be mapped by the huge pages, THP > +manages such mappings transparently to the user and hence the > +name. See > +:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>` > +for more details about THP. > + > +Zones > +===== > + > +Often hardware poses restrictions on how different physical memory > +ranges can be accessed. In some cases, devices cannot perform DMA to > +all the addressable memory. In other cases, the size of the physical > +memory exceeds the maximal addressable size of virtual memory and > +special actions are required to access portions of the memory. Linux > +groups memory pages into `zones` according to their possible > +usage. For example, ZONE_DMA will contain memory that can be used by > +devices for DMA, ZONE_HIGHMEM will contain memory that is not > +permanently mapped into kernel's address space and ZONE_NORMAL will > +contain normally addressed pages. > + > +The actual layout of the memory zones is hardware dependent as not all > +architectures define all zones, and requirements for DMA are different > +for different platforms. > + > +Nodes > +===== > + > +Many multi-processor machines are NUMA - Non-Uniform Memory Access - > +systems. In such systems the memory is arranged into banks that have > +different access latency depending on the "distance" from the > +processor. Each bank is referred as `node` and for each node Linux is referred to as a `node` > +constructs an independent memory management subsystem. A node has it's its > +own set of zones, lists of free and used pages and various statistics > +counters. You can find more details about NUMA in > +:ref:`Documentation/vm/numa.rst <numa>` and in > +:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`. > + > +Page cache > +========== > + > +The physical memory is volatile and the common case for getting data > +into the memory is to read it from files. Whenever a file is read, the > +data is put into the `page cache` to avoid expensive disk access on > +the subsequent reads. Similarly, when one writes to a file, the data > +is placed in the page cache and eventually gets into the backing > +storage device. The written pages are marked as `dirty` and when Linux > +decides to reuse them for other purposes, it makes sure to synchronize > +the file contents on the device with the updated data. > + > +Anonymous Memory > +================ > + > +The `anonymous memory` or `anonymous mappings` represent memory that > +is not backed by a filesystem. Such mappings are implicitly created > +for program's stack and heap or by explicit calls to mmap(2) system > +call. Usually, the anonymous mappings only define virtual memory areas > +that the program is allowed to access. The read accesses will result > +in creation of a page table entry that references a special physical > +page filled with zeroes. When the program performs a write, regular write, a regular > +physical page will be allocated to hold the written data. The page > +will be marked dirty and if the kernel will decide to repurpose it, > +the dirty page will be swapped out. > + > +Reclaim > +======= > + > +Throughout the system lifetime, a physical page can be used for storing > +different types of data. It can be kernel internal data structures, > +DMA'able buffers for device drivers use, data read from a filesystem, > +memory allocated by user space processes etc. > + > +Depending on the page usage it is treated differently by the Linux > +memory management. The pages that can be freed at any time, either > +because they cache the data available elsewhere, for instance, on a > +hard disk, or because they can be swapped out, again, to the hard > +disk, are called `reclaimable`. The most notable categories of the > +reclaimable pages are page cache and anonymous memory. > + > +In most cases, the pages holding internal kernel data and used as DMA > +buffers cannot be repurposed, and they remain pinned until freed by > +their user. Such pages are called `unreclaimable`. However, in certain > +circumstances, even pages occupied with kernel data structures can be > +reclaimed. For instance, in-memory caches of filesystem metadata can > +be re-read from the storage device and therefore it is possible to > +discard them from the main memory when system is under memory > +pressure. > + > +The process of freeing the reclaimable physical memory pages and > +repurposing them is called (surprise!) `reclaim`. Linux can reclaim > +pages either asynchronously or synchronously, depending on the state > +of the system. When system is not loaded, most of the memory is free When {the, a} system > +and allocation request will be satisfied immediately from the free requests or and an allocation request > +pages supply. As the load increases, the amount of the free pages goes > +down and when it reaches a certain threshold (high watermark), an > +allocation request will awaken the ``kswapd`` daemon. It will > +asynchronously scan memory pages and either just free them if the data > +they contain is available elsewhere, or evict to the backing storage > +device (remember those dirty pages?). As memory usage increases even > +more and reaches another threshold - min watermark - an allocation > +will trigger the `direct reclaim`. In this case allocation is stalled s/the// > +until enough memory pages are reclaimed to satisfy the request. > + > +Compaction > +========== > + > +As the system runs, tasks allocate and free the memory and it becomes > +fragmented. Although with virtual memory it is possible to present > +scattered physical pages as virtually contiguous range, sometimes it is > +necessary to allocate large physically contiguous memory areas. Such > +need may arise, for instance, when a device driver requires large requires a large > +buffer for DMA, or when THP allocates a huge page. Memory `compaction` > +addresses the fragmentation issue. This mechanism moves occupied pages > +from the lower part of a memory zone to free pages in the upper part > +of the zone. When a compaction scan is finished free pages are grouped > +together at the beginning of the zone and allocations of large > +physically contiguous areas become possible. > + > +Like reclaim, the compaction may happen asynchronously in ``kcompactd`` in the > +daemon or synchronously as a result of memory allocation request. of a memory allocation request. > + > +OOM killer > +========== > + > +It may happen, that on a loaded machine memory will be exhausted. When no comma. > +the kernel detects that the system runs out of memory (OOM) it invokes > +`OOM killer`. Its mission is simple: all it has to do is to select a > +task to sacrifice for the sake of the overall system health. The > +selected task is killed in a hope that after it exits enough memory > +will be freed to continue normal operation. thanks for doing this overview. -- ~Randy