On Mon, Apr 11, 2022 at 8:16 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Wed, 6 Apr 2022 21:15:25 -0600 Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > > > +Kill switch > > +----------- > > +``enable`` accepts different values to enable or disable the following > > It's actually called "enabled". Good catch. Thanks! > And I suggest that the file name be > included right there in the title. ie. > > "enabled": Kill Switch > ====================== Will do. > > +Experimental features > > +===================== > > +``/sys/kernel/debug/lru_gen`` accepts commands described in the > > +following subsections. Multiple command lines are supported, so does > > +concatenation with delimiters ``,`` and ``;``. > > + > > +``/sys/kernel/debug/lru_gen_full`` provides additional stats for > > +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from > > +evicted generations in this file. > > + > > +Working set estimation > > +---------------------- > > +Working set estimation measures how much memory an application > > +requires in a given time interval, and it is usually done with little > > +impact on the performance of the application. E.g., data centers want > > +to optimize job scheduling (bin packing) to improve memory > > +utilizations. When a new job comes in, the job scheduler needs to find > > +out whether each server it manages can allocate a certain amount of > > +memory for this new job before it can pick a candidate. To do so, this > > +job scheduler needs to estimate the working sets of the existing jobs. > > These various sysfs interfaces are a big deal. Because they are so > hard to change once released. Debugfs, not sysfs. The title is "Experimental features" :) > btw, what is this "job scheduler" of which you speak? Basically it's part of cluster management software. Many jobs (programs + data) can run concurrently in the same cluster and the job scheduler of this cluster does the bin packing. To improve resource utilization, the job scheduler needs to know the (memory) size of each job it packs, hence the working set estimation (how much memory a job uses within a given time interval). The job scheduler also takes memory from some jobs so that those jobs can better fit into a single machine (proactive reclaim). > Is there an open > source implementation upon which we hope the world will converge? There are many [1], e.g., Kubernetes (k8s). Personally, I don't think they'll ever converge. At the moment, all open source implementations I know of rely on users manually specifying the size of each job (job spec), e.g., [2]. Users overprovision memory to avoid OOM kills. The average memory utilization generally is surprisingly low. What we can hope for is that eventually some of the open source implementations will use the working set estimation and proactive reclaim features provided here. [1] https://en.wikipedia.org/wiki/List_of_cluster_management_software [2] https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/ > > +Proactive reclaim > > +----------------- > > +Proactive reclaim induces memory reclaim when there is no memory > > +pressure and usually targets cold memory only. E.g., when a new job > > +comes in, the job scheduler wants to proactively reclaim memory on the > > +server it has selected to improve the chance of successfully landing > > +this new job. > > + > > +Users can write ``- memcg_id node_id min_gen_nr [swappiness > > +[nr_to_reclaim]]`` to ``lru_gen`` to evict generations less than or > > +equal to ``min_gen_nr``. Note that ``min_gen_nr`` should be less than > > +``max_gen_nr-1`` as ``max_gen_nr`` and ``max_gen_nr-1`` are not fully > > +aged and therefore cannot be evicted. ``swappiness`` overrides the > > +default value in ``/proc/sys/vm/swappiness``. ``nr_to_reclaim`` limits > > +the number of pages to evict. > > + > > +A typical use case is that a job scheduler writes to ``lru_gen`` > > +before it tries to land a new job on a server, and if it fails to > > +materialize the cold memory without impacting the existing jobs on > > +this server, it retries on the next server according to the ranking > > +result obtained from the working set estimation step described > > +earlier. > > It sounds to me that these interfaces were developed in response to > ongoing development and use of a particular job scheduler. I did borrow some of my previous experience with Google's data centers. But I'm a Chrome OS developer now, so I designed them to be job scheduler agnostic :) > This is a very good thing, but has thought been given to the potential > needs of other job schedulers? Yes, basically I'm trying to help everybody replicate the success stories at Google and Meta [3][4]. [3] https://dl.acm.org/doi/10.1145/3297858.3304053 [4] https://dl.acm.org/doi/10.1145/3503222.3507731