Hi Bagas, Thanks for your comments! On 2/24/25 01:43, Bagas Sanjaya wrote: > On Sun, Feb 23, 2025 at 06:53:59PM +0000, Jiwen Qi wrote: >> Briefly describe what zones are and the fields of struct zone. >> > > Cc'ing Mike. > >> Signed-off-by: Jiwen Qi <jiwen7.qi@xxxxxxxxx> >> --- >> Documentation/mm/physical_memory.rst | 259 ++++++++++++++++++++++++++- >> 1 file changed, 257 insertions(+), 2 deletions(-) >> >> diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst >> index 71fd4a6acf42..227997694851 100644 >> --- a/Documentation/mm/physical_memory.rst >> +++ b/Documentation/mm/physical_memory.rst >> @@ -338,10 +338,265 @@ Statistics >> >> Zones >> ===== >> +As we have mentioned, each zone in memory is described by a ``struct zone`` >> +which is an element of the ``node_zones`` field of the node it belongs to. A >> +zone represents a range of physical memory. A zone may have holes. The > ..., and may have holes. I will change it to "a range of physical memory and may have holes." as suggested. >> +``spanned_pages`` field represents the total pages spanned by the zone, >> +the ``present_pages`` field represents the physical pages existing within the > ; and the ... I will remove this part as suggested by Mike. >> +zone and the managed_page field represents the pages managed by the buddy system. >> + >> +Linux uses the GFP flags, see ``include/linux/gfp_types.h``, specified by > or (see :ref:`Documentation/core-api/mm-api.rst <mm-api-gfp-flags>` for reference on these flags)? I will change it to "Linux uses the GFP flags, see :ref:`mm-api-gfp-flags`, specified by" as suggested. >> +a memory allocation to determine the highest zone in a node from which >> +the memory allocation can allocate memory. Linux first allocates memory from > The kernel first ... I will change it to "The kernel first allocates memory from" as suggested. >> +that zone, if Linux can't allocate the requested amount of memory from the zone, >> +it will allocate memory from the next lower zone in the node, the process >> +continues up to and including the lowest zone. For example, if a node contains >> +``ZONE_DMA32``, ``ZONE_NORMAL`` and ``ZONE_MOVABLE`` and the highest zone of a >> +memory allocation is ``ZONE_MOVABLE``, the order of the zones from which Linux >> +allocates memory is ``ZONE_MOVABLE`` > ``ZONE_NORMAL`` > ``ZONE_DMA32``. > ... from which the kernel allocates ... I will replace "Linux" with "the kernel" as suggested. >> + >> +At runtime, free pages in a zone are in the Per-CPU Pagesets (PCP) or free areas >> +of the zone. The Per-CPU Pagesets is pointed by the ``per_cpu_pageset`` filed. >> +The free areas is pointed by the ``free_area`` field. The Per-CPU Pagesets are >> +a vital mechanism in the Linux kernel's memory management system. By handling >> +most frequent allocations and frees locally on each CPU, the Per-CPU Pagesets >> +improve performance and scalability, especially on systems with many cores. The >> +page allocator in the Linux kernel employs a two-step strategy for memory >> +allocation, starting with the Per-CPU Pagesets before falling back to the buddy >> +allocator. Pages are transferred between the Per-CPU Pagesets and the global >> +free areas (managed by the buddy allocator) in batches. This minimizes the >> +overhead of frequent interactions with the global buddy allocator. Free areas in >> +a zone are represented by an array of ``free_area``, where each element >> +corresponds to a specific order which is a power of two." >> + >> +Architecture specific code calls free_area_init() to initializes zones. >> + >> +Zone structure >> +-------------- >> >> -.. admonition:: Stub >> +The zones structure ``struct zone`` is declared in ``include/linux/mmzone.h``. > ... defined in ... I will change it as suggested. >> +Here we briefly describe fields of this structure: >> >> - This section is incomplete. Please list and describe the appropriate fields. >> +General >> +~~~~~~~ >> + >> +``_watermark`` >> + The watermarks for this zone. The min watermark is the point where boosting is >> + ignored and an allocation may trigger direct reclaim and direct compaction. >> + It is also used to throttle direct reclaim. The low watermark is the point >> + where kswapd is woken up. The high watermark is the point where kswapd stops >> + reclaiming (a zone is balanced) when the ``NUMA_BALANCING_MEMORY_TIERING`` >> + bit of ``sysctl_numa_balancing_mode`` is not set. The promo watermark is used >> + for memory tiering and NUMA balancing. It is the point where kswapd stops >> + reclaiming when the ``NUMA_BALANCING_MEMORY_TIERING`` bit of >> + ``sysctl_numa_balancing_mode`` is set. The watermarks are set by >> + ``__setup_per_zone_wmarks()``. the min watermark is calculated according to >> + ``vm.min_free_kbytes`` sysctl. The other three watermarks are set according >> + to the distance between two watermarks. The distance is caculated according >> + to ``vm.watermark_scale_factor`` sysctl. > The distance itself is calculated taking ``vm.watermark_scale_factor`` into > account. I will change it to "The distance itself is calculated taking ``vm.watermark_scale_factor`` sysctl into account" as suggested. >> + >> +``watermark_boost`` >> + The number of pages which are used to boost watermarks to increase reclaim >> + pressure to reduce the likelihood of future fallbacks and wake kswapd now >> + as the node may be balanced overall and kswapd will not wake naturally. >> + >> +``nr_reserved_highatomic`` >> + The number of pages which are reserved for high-order atomic allocations. >> + >> +``nr_free_highatomic`` >> + The number of free pages in reserved highatomic pageblocks >> + >> +``lowmem_reserve`` >> + The array of the amounts of the memory reserved in this zone for memory >> + allocations. For example, if the highest zone a memory allocation can >> + allocate memory from is ``ZONE_MOVABLE``, the amount of memory reserved in >> + this zone for this allocation is ``lowmem_reserve[ZONE_MOVABLE]`` when >> + attempting to allocate memory from this zone. The reason is that we don't know >> + if the memory that we're going to allocate will be freeable or/and it will be >> + released eventually, so to avoid totally wasting several GB of ram we must >> + reserve some of the lower zone memory (otherwise we risk to run OOM on the >> + lower zones despite there being tons of freeable ram on the higher zones). >> + This array is recalculated by ``setup_per_zone_lowmem_reserve()`` at runtime >> + if ``vm.lowmem_reserve_ratio`` sysctl changes. >> + >> +``node`` >> + The index of the node this zone belongs to. Available only when >> + ``CONFIG_NUMA`` is enabled because there is only one zone in a UMA system. >> + >> +``zone_pgdat`` >> + Pointer to the pglist_data of the node this zone belongs to. >> + >> +``per_cpu_pageset`` >> + Pointer to the Per-CPU Pagesets (PCP) allocated and initialized by >> + ``setup_zone_pageset()``. By handling most frequent allocations and frees >> + locally on each CPU, the Per-CPU Pagesets improve performance and scalability > PCP improves ... I will change it to "PCP improves performance and scalability" as suggested. >> + on systems with many cores. >> + >> +``pageset_high_min`` >> + Copied to the ``high_min`` of the Per-CPU Pagesets for faster access. >> + >> +``pageset_high_max`` >> + Copied to the ``high_max`` of the Per-CPU Pagesets for faster access. >> + >> +``pageset_batch`` >> + Copied to the ``batch`` of the Per-CPU Pagesets for faster access. The >> + ``batch``, ``high_min`` and ``high_max`` of the Per-CPU Pagesets are used to >> + calculate the number of elements the Per-CPU Pagesets obtain from the buddy >> + allocator under a single hold of the lock for efficiency. They are also used >> + to decide if the Per-CPU Pagesets return pages to the buddy allocator in page >> + free process. >> + >> +``pageblock_flags`` >> + The pointer to the flags for the pageblocks in the system. See >> + ``include/linux/pageblock-flags.h``. The memory is allocated in > (see ``include/linux/pageblock-flags.h`` for flags list). I'll change it to "system (see ``include/linux/pageblock-flags.h`` for flags list)." as suggested. >> + ``setup_usemap()``. Each pageblock occupies ``NR_PAGEBLOCK_BITS`` bits. >> + Defined only when ``CONFIG_FLATMEM`` is enabled. The flags is stored in >> + ``mem_section`` when ``CONFIG_SPARSEMEM`` is enabled. >> + >> +``spanned_pages`` >> + The total pages spanned by the zone, including holes, which is calculated as: >> + ``spanned_pages`` = ``zone_end_pfn`` - ``zone_start_pfn``. It is initialized >> + by ``calculate_node_totalpages()``. >> + >> +``nr_isolate_pageblock`` >> + Number of isolated pageblocks. It is used to solve incorrect freepage counting >> + problem due to racy retrieving migratetype of pageblock. Protected by >> + ``zone->lock``. Defined only when ``CONFIG_MEMORY_ISOLATION`` is enabled. >> + >> +``span_seqlock`` >> + The seqlock to protect ``zone_start_pfn`` and ``spanned_pages``. It is a >> + seqlock because it has to be read outside of ``zone->lock``, and it is done in >> + the main allocator path. But, it is written quite infrequently. Defined only > However, the seqlock is ... I'll change it to "However, the seqlock is written quite infrequently." as suggested. >> + when ``CONFIG_MEMORY_HOTPLUG`` is enabled. >> + >> +``initialized`` >> + The flag indicating if the zone is initialized. Set by >> + ``init_currently_empty_zone()`` during boot. >> + >> +``percpu_drift_mark`` >> + When free pages are below this point, additional steps are taken when reading >> + the number of free pages to avoid per-cpu counter drift allowing watermarks >> + to be breached. It is updated in ``refresh_zone_stat_thresholds()``. >> + >> +Compaction control >> +~~~~~~~~~~~~~~~~~~ >> + >> +``compact_cached_free_pfn`` >> + The PFN where compaction free scanner should start in the next scan. >> + >> +``compact_cached_migrate_pfn`` >> + The PFNs where compaction migration scanner should start in the next scan. >> + This array has two elements, the first one is used in ``MIGRATE_ASYNC`` mode, >> + the other is used in ``MIGRATE_SYNC`` mode. > This array has two elements: the first one is ..., and the other one is ... I'll change it as suggested. >> + >> +``compact_init_migrate_pfn`` >> + The initial migration PFN which is initialized to 0 at boot time, and to the >> + first pageblock with migratable pages in the zone after a full compaction >> + finishes. It is used to check if a scan is a whole zone scan or not. >> + >> +``compact_blockskip_flush`` >> + Set to true when compaction migration scanner and free scanner meet, which >> + means the ``PB_migrate_skip`` bits should be cleared. >> + >> +``contiguous`` >> + Set to true when the zone is contiguous (there is no hole). > (in other words, no hole). I'll change it as suggested. >> + >> +Statistics >> +~~~~~~~~~~ >> + >> +``vm_stat`` >> + VM statistics for the zone. The items tracked are defined by >> + ``enum zone_stat_item``. >> + >> +``vm_numa_event`` >> + VM NUMA event statistics for the zone. The items tracked are defined by >> + ``enum numa_stat_item``. >> + >> +``per_cpu_zonestats`` >> + Per-CPU VM statistics for the zone. It records VM statistics and VM NUMA event >> + statistics on a per-CPU basis. It reduces updates to the global ``vm_stat`` >> + and ``vm_numa_event`` fields of the zone to improve performance. >> >> .. _pages: >> >> > > Thanks. > -- Regards, Jiwen