Hi, On Tue, Feb 08, 2022 at 01:19:02AM -0700, Yu Zhao wrote: > Add a design doc and an admin guide. > > Signed-off-by: Yu Zhao <yuzhao@xxxxxxxxxx> > Acked-by: Brian Geffon <bgeffon@xxxxxxxxxx> > Acked-by: Jan Alexander Steffens (heftig) <heftig@xxxxxxxxxxxxx> > Acked-by: Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx> > Acked-by: Steven Barrett <steven@xxxxxxxxxxxx> > Acked-by: Suleiman Souhlal <suleiman@xxxxxxxxxx> > Tested-by: Daniel Byrne <djbyrne@xxxxxxx> > Tested-by: Donald Carr <d@xxxxxxxxxxxxxxx> > Tested-by: Holger Hoffstätte <holger@xxxxxxxxxxxxxxxxxxxxxx> > Tested-by: Konstantin Kharlamov <Hi-Angel@xxxxxxxxx> > Tested-by: Shuang Zhai <szhai2@xxxxxxxxxxxxxxxx> > Tested-by: Sofia Trinh <sofia.trinh@edi.works> > --- > Documentation/admin-guide/mm/index.rst | 1 + > Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++ > Documentation/vm/index.rst | 1 + > Documentation/vm/multigen_lru.rst | 152 ++++++++++++++++++ Please consider splitting this patch into Documentation/admin-guide and Documentation/vm parts. For now I only had time to review the admin-guide part. > 4 files changed, 275 insertions(+) > create mode 100644 Documentation/admin-guide/mm/multigen_lru.rst > create mode 100644 Documentation/vm/multigen_lru.rst > > diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst > index c21b5823f126..2cf5bae62036 100644 > --- a/Documentation/admin-guide/mm/index.rst > +++ b/Documentation/admin-guide/mm/index.rst > @@ -32,6 +32,7 @@ the Linux memory management. > idle_page_tracking > ksm > memory-hotplug > + multigen_lru > nommu-mmap > numa_memory_policy > numaperf > diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst > new file mode 100644 > index 000000000000..16a543c8b886 > --- /dev/null > +++ b/Documentation/admin-guide/mm/multigen_lru.rst > @@ -0,0 +1,121 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +===================== > +Multigenerational LRU > +===================== + > +Quick start > +=========== There is no explanation why one would want to use multigenerational LRU until the next section. I think there should be an overview that explains why users would want to enable multigenerational LRU. > +Build configurations > +-------------------- > +:Required: Set ``CONFIG_LRU_GEN=y``. Maybe Set ``CONFIG_LRU_GEN=y`` to build kernel with multigenerational LRU > + > +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the > + multigenerational LRU by default. > + > +Runtime configurations > +---------------------- > +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if > + ``CONFIG_LRU_GEN_ENABLED=n``. > + > +This file accepts different values to enabled or disabled the > +following features: Maybe After multigenerational LRU is enabled, this file accepts different values to enable or disable the following feaures: > +====== ======== > +Values Features > +====== ======== > +0x0001 the multigenerational LRU The multigenerational LRU what? What will happen if I write 0x2 to this file? Please consider splitting "enable" and "features" attributes. > +0x0002 clear the accessed bit in leaf page table entries **in large > + batches**, when MMU sets it (e.g., on x86) Is extra markup really needed here... > +0x0004 clear the accessed bit in non-leaf page table entries **as > + well**, when MMU sets it (e.g., on x86) ... and here? As for the descriptions, what is the user-visible effect of these features? How different modes of clearing the access bit are reflected in, say, GUI responsiveness, database TPS, or probability of OOM? > +[yYnN] apply to all the features above > +====== ======== > + > +E.g., > +:: > + > + echo y >/sys/kernel/mm/lru_gen/enabled > + cat /sys/kernel/mm/lru_gen/enabled > + 0x0007 > + echo 5 >/sys/kernel/mm/lru_gen/enabled > + cat /sys/kernel/mm/lru_gen/enabled > + 0x0005 > + > +Most users should enable or disable all the features unless some of > +them have unforeseen side effects. > + > +Recipes > +======= > +Personal computers > +------------------ > +Personal computers are more sensitive to thrashing because it can > +cause janks (lags when rendering UI) and negatively impact user > +experience. The multigenerational LRU offers thrashing prevention to > +the majority of laptop and desktop users who don't have oomd. I'd expect something like this paragraph in overview. > + > +:Thrashing prevention: Write ``N`` to > + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of > + ``N`` milliseconds from getting evicted. The OOM killer is triggered > + if this working set can't be kept in memory. Based on the average > + human detectable lag (~100ms), ``N=1000`` usually eliminates > + intolerable janks due to thrashing. Larger values like ``N=3000`` > + make janks less noticeable at the risk of premature OOM kills. > + > +Data centers > +------------ > +Data centers want to optimize job scheduling (bin packing) to improve > +memory utilizations. Job schedulers need to estimate whether a server > +can allocate a certain amount of memory for a new job, and this step > +is known as working set estimation, which doesn't impact the existing > +jobs running on this server. They also want to attempt freeing some > +cold memory from the existing jobs, and this step is known as proactive > +reclaim, which improves the chance of landing a new job successfully. This paragraph also fits overview. > + > +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations > + for working set estimation and proactive reclaim. Please add a note that this is build time option. > + > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following Is debugfs interface relevant only for datacenters? > + format: > + :: > + > + memcg memcg_id memcg_path > + node node_id > + min_gen birth_time anon_size file_size > + ... > + max_gen birth_time anon_size file_size > + > + ``min_gen`` is the oldest generation number and ``max_gen`` is the > + youngest generation number. ``birth_time`` is in milliseconds. It's unclear what is birth_time reference point. Is it milliseconds from the system start or it is measured some other way? > + ``anon_size`` and ``file_size`` are in pages. The youngest generation > + represents the group of the MRU pages and the oldest generation > + represents the group of the LRU pages. For working set estimation, a Please spell out MRU and LRU fully. > + job scheduler writes to this file at a certain time interval to > + create new generations, and it ranks available servers based on the > + sizes of their cold memory defined by this time interval. For > + proactive reclaim, a job scheduler writes to this file before it > + tries to land a new job, and if it fails to materialize the cold > + memory without impacting the existing jobs, it retries on the next > + server according to the ranking result. Is this knob only relevant for a job scheduler? Or it can be used in other use-cases as well? > + > + This file accepts commands in the following subsections. Multiple ^ described > + command lines are supported, so does concatenation with delimiters > + ``,`` and ``;``. > + > + ``/sys/kernel/debug/lru_gen_full`` contains additional stats for > + debugging. > + > +:Working set estimation: Write ``+ memcg_id node_id max_gen > + [can_swap [full_scan]]`` to ``/sys/kernel/debug/lru_gen`` to invoke > + the aging. It scans PTEs for hot pages and promotes them to the > + youngest generation ``max_gen``. Then it creates a new generation > + ``max_gen+1``. Set ``can_swap`` to ``1`` to scan for hot anon pages > + when swap is off. Set ``full_scan`` to ``0`` to reduce the overhead > + as well as the coverage when scanning PTEs. > + > +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness > + [nr_to_reclaim]]`` to ``/sys/kernel/debug/lru_gen`` to invoke the > + eviction. It evicts generations less than or equal to ``min_gen``. > + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and > + ``max_gen-1`` aren't fully aged and therefore can't be evicted. Use > + ``nr_to_reclaim`` to limit the number of pages to evict. I feel that /sys/kernel/debug/lru_gen is too overloaded. > diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst > index 44365c4574a3..b48434300226 100644 > --- a/Documentation/vm/index.rst > +++ b/Documentation/vm/index.rst > @@ -25,6 +25,7 @@ algorithms. If you are looking for advice on simply allocating memory, see the > ksm > memory-model > mmu_notifier > + multigen_lru > numa > overcommit-accounting > page_migration -- Sincerely yours, Mike.