On Mon, Feb 14, 2022 at 12:28:56PM +0200, Mike Rapoport wrote: Thanks for reviewing. > > Documentation/admin-guide/mm/index.rst | 1 + > > Documentation/admin-guide/mm/multigen_lru.rst | 121 ++++++++++++++ > > Documentation/vm/index.rst | 1 + > > Documentation/vm/multigen_lru.rst | 152 ++++++++++++++++++ > > Please consider splitting this patch into Documentation/admin-guide and > Documentation/vm parts. Will do. > > +===================== > > +Multigenerational LRU > > +===================== > + > > +Quick start > > +=========== > > There is no explanation why one would want to use multigenerational LRU > until the next section. > > I think there should be an overview that explains why users would want to > enable multigenerational LRU. Will do. > > +Build configurations > > +-------------------- > > +:Required: Set ``CONFIG_LRU_GEN=y``. > > Maybe > > Set ``CONFIG_LRU_GEN=y`` to build kernel with multigenerational LRU Will do. > > +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to enable the > > + multigenerational LRU by default. > > + > > +Runtime configurations > > +---------------------- > > +:Required: Write ``y`` to ``/sys/kernel/mm/lru_gen/enable`` if > > + ``CONFIG_LRU_GEN_ENABLED=n``. > > + > > +This file accepts different values to enabled or disabled the > > +following features: > > Maybe > > After multigenerational LRU is enabled, this file accepts different > values to enable or disable the following feaures: Will do. > > +====== ======== > > +Values Features > > +====== ======== > > +0x0001 the multigenerational LRU > > The multigenerational LRU what? Itself? This depends on the POV, and I'm trying to determine what would be the natural way to present it. MGLRU itself could be seen as an add-on atop the existing page reclaim or an alternative in parallel. The latter would be similar to sl[aou]b, and that's how I personally see it. But here I presented it more like the former because I feel this way is more natural to users because they are like switches on a single panel. > What will happen if I write 0x2 to this file? Just like turning on a branch breaker while leaving the main breaker off in a circuit breaker box. This is how I see it, and I'm totally fine with changing it to whatever you'd recommend. > Please consider splitting "enable" and "features" attributes. How about s/Features/Components/? > > +0x0002 clear the accessed bit in leaf page table entries **in large > > + batches**, when MMU sets it (e.g., on x86) > > Is extra markup really needed here... > > > +0x0004 clear the accessed bit in non-leaf page table entries **as > > + well**, when MMU sets it (e.g., on x86) > > ... and here? Will do. > As for the descriptions, what is the user-visible effect of these features? > How different modes of clearing the access bit are reflected in, say, GUI > responsiveness, database TPS, or probability of OOM? These remain to be seen :) I just added these switches in v7, per Mel's request from the meeting we had. These were never tested in the field. > > +[yYnN] apply to all the features above > > +====== ======== > > + > > +E.g., > > +:: > > + > > + echo y >/sys/kernel/mm/lru_gen/enabled > > + cat /sys/kernel/mm/lru_gen/enabled > > + 0x0007 > > + echo 5 >/sys/kernel/mm/lru_gen/enabled > > + cat /sys/kernel/mm/lru_gen/enabled > > + 0x0005 > > + > > +Most users should enable or disable all the features unless some of > > +them have unforeseen side effects. > > + > > +Recipes > > +======= > > +Personal computers > > +------------------ > > +Personal computers are more sensitive to thrashing because it can > > +cause janks (lags when rendering UI) and negatively impact user > > +experience. The multigenerational LRU offers thrashing prevention to > > +the majority of laptop and desktop users who don't have oomd. > > I'd expect something like this paragraph in overview. > > > + > > +:Thrashing prevention: Write ``N`` to > > + ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to prevent the working set of > > + ``N`` milliseconds from getting evicted. The OOM killer is triggered > > + if this working set can't be kept in memory. Based on the average > > + human detectable lag (~100ms), ``N=1000`` usually eliminates > > + intolerable janks due to thrashing. Larger values like ``N=3000`` > > + make janks less noticeable at the risk of premature OOM kills. > > > + > > +Data centers > > +------------ > > +Data centers want to optimize job scheduling (bin packing) to improve > > +memory utilizations. Job schedulers need to estimate whether a server > > +can allocate a certain amount of memory for a new job, and this step > > +is known as working set estimation, which doesn't impact the existing > > +jobs running on this server. They also want to attempt freeing some > > +cold memory from the existing jobs, and this step is known as proactive > > +reclaim, which improves the chance of landing a new job successfully. > > This paragraph also fits overview. Will do. > > +:Optional: Increase ``CONFIG_NR_LRU_GENS`` to support more generations > > + for working set estimation and proactive reclaim. > > Please add a note that this is build time option. Will do. > > +:Debugfs interface: ``/sys/kernel/debug/lru_gen`` has the following > > Is debugfs interface relevant only for datacenters? For the moment, yes. > > + format: > > + :: > > + > > + memcg memcg_id memcg_path > > + node node_id > > + min_gen birth_time anon_size file_size > > + ... > > + max_gen birth_time anon_size file_size > > + > > + ``min_gen`` is the oldest generation number and ``max_gen`` is the > > + youngest generation number. ``birth_time`` is in milliseconds. > > It's unclear what is birth_time reference point. Is it milliseconds from > the system start or it is measured some other way? Good point. Will clarify. > > + ``anon_size`` and ``file_size`` are in pages. The youngest generation > > + represents the group of the MRU pages and the oldest generation > > + represents the group of the LRU pages. For working set estimation, a > > Please spell out MRU and LRU fully. Will do. > > + job scheduler writes to this file at a certain time interval to > > + create new generations, and it ranks available servers based on the > > + sizes of their cold memory defined by this time interval. For > > + proactive reclaim, a job scheduler writes to this file before it > > + tries to land a new job, and if it fails to materialize the cold > > + memory without impacting the existing jobs, it retries on the next > > + server according to the ranking result. > > Is this knob only relevant for a job scheduler? Or it can be used in other > use-cases as well? There are other concrete use cases but I'm not ready to discuss them yet. > > + This file accepts commands in the following subsections. Multiple > > ^ described Will do.