Thanks for the response Johannes. Some replies inline. On Tue, Nov 26, 2024 at 11:26 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote: > > This patch series provides workingset reporting of user pages in > > lruvecs, of which coldness can be tracked by accessed bits and fd > > references. However, the concept of workingset applies generically to > > all types of memory, which could be kernel slab caches, discardable > > userspace caches (databases), or CXL.mem. Therefore, data sources might > > come from slab shrinkers, device drivers, or the userspace. > > Another interesting idea might be hugepage workingset, so that we can > > measure the proportion of hugepages backing cold memory. However, with > > architectures like arm, there may be too many hugepage sizes leading to > > a combinatorial explosion when exporting stats to the userspace. > > Nonetheless, the kernel should provide a set of workingset interfaces > > that is generic enough to accommodate the various use cases, and extensible > > to potential future use cases. > > Doesn't DAMON already provide this information? > > CCing SJ. Thanks for the CC. DAMON was really good at visualizing the memory access frequencies last time I tried it out! For server use cases, DAMON would benefit from integrations with cgroups. The key then would be a standard interface for exporting a cgroup's working set to the user. It would be good to have something that will work for different backing implementations, DAMON, MGLRU, or active/inactive LRU. > > > Use cases > > ========== > > Job scheduling > > On overcommitted hosts, workingset information improves efficiency and > > reliability by allowing the job scheduler to have better stats on the > > exact memory requirements of each job. This can manifest in efficiency by > > landing more jobs on the same host or NUMA node. On the other hand, the > > job scheduler can also ensure each node has a sufficient amount of memory > > and does not enter direct reclaim or the kernel OOM path. With workingset > > information and job priority, the userspace OOM killing or proactive > > reclaim policy can kick in before the system is under memory pressure. > > If the job shape is very different from the machine shape, knowing the > > workingset per-node can also help inform page allocation policies. > > > > Proactive reclaim > > Workingset information allows the a container manager to proactively > > reclaim memory while not impacting a job's performance. While PSI may > > provide a reactive measure of when a proactive reclaim has reclaimed too > > much, workingset reporting allows the policy to be more accurate and > > flexible. > > I'm not sure about more accurate. > > Access frequency is only half the picture. Whether you need to keep > memory with a given frequency resident depends on the speed of the > backing device. > > There is memory compression; there is swap on flash; swap on crappy > flash; swapfiles that share IOPS with co-located filesystems. There is > zswap+writeback, where avg refault speed can vary dramatically. > > You can of course offload much more to a fast zswap backend than to a > swapfile on a struggling flashdrive, with comparable app performance. > > So I think you'd be hard pressed to achieve a high level of accuracy > in the usecases you list without taking the (often highly dynamic) > cost of paging / memory transfer into account. > > There is a more detailed discussion of this in a paper we wrote on > proactive reclaim/offloading - in 2.5 Hardware Heterogeneity: > > https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf > Yes, PSI takes into account the paging cost. I'm not claiming that Workingset reporting provides a superset of information, but rather it can complement PSI. Sorry for the bad wording here. > > Ballooning (similar to proactive reclaim) > > The last patch of the series extends the virtio-balloon device to report > > the guest workingset. > > Balloon policies benefit from workingset to more precisely determine the > > size of the memory balloon. On end-user devices where memory is scarce and > > overcommitted, the balloon sizing in multiple VMs running on the same > > device can be orchestrated with workingset reports from each one. > > On the server side, workingset reporting allows the balloon controller to > > inflate the balloon without causing too much file cache to be reclaimed in > > the guest. The ballooning use case is an important one. Having working set information would allow us to inflate a balloon of the right size in the guest. > > > > Promotion/Demotion > > If different mechanisms are used for promition and demotion, workingset > > information can help connect the two and avoid pages being migrated back > > and forth. > > For example, given a promotion hot page threshold defined in reaccess > > distance of N seconds (promote pages accessed more often than every N > > seconds). The threshold N should be set so that ~80% (e.g.) of pages on > > the fast memory node passes the threshold. This calculation can be done > > with workingset reports. > > To be directly useful for promotion policies, the workingset report > > interfaces need to be extended to report hotness and gather hotness > > information from the devices[1]. > >... > > > > Benchmarks > > ========== > > Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux > > compile and redis benchmarks from openbenchmarking.org. The policy and > > runner is referred to as WMO (Workload Memory Optimization). > > The results were based on v3 of the series, but v4 doesn't change the core > > of the working set reporting and just adds the ballooning counterpart. > > > > The timed Linux kernel compilation benchmark shows improvements in peak > > memory usage with a policy of "swap out all bytes colder than 10 seconds > > every 40 seconds". A swapfile is configured on SSD. > > -------------------------------------------- > > peak memory usage (with WMO): 4982.61328 MiB > > peak memory usage (control): 9569.1367 MiB > > peak memory reduction: 47.9% > > -------------------------------------------- > > Benchmark | Experimental |Control | Experimental_Std_Dev | Control_Std_Dev > > Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) | 679.499 (100%) | 0.6% | 0.1% > > -------------------------------------------- > > Seconds, fewer is better > > You can do this with a recent (>2018) upstream kernel and ~100 lines > of python [1]. It also works on both LRU implementations. > > [1] https://github.com/facebookincubator/senpai > > We use this approach in virtually the entire Meta fleet, to offload > unneeded memory, estimate available capacity for job scheduling, plan > future capacity needs, and provide accurate memory usage feedback to > application developers. > > It works over a wide variety of CPU and storage configurations with no > specific tuning. > > The paper I referenced above provides a detailed breakdown of how it > all works together. > > I would be curious to see a more in-depth comparison to the prior art > in this space. At first glance, your proposal seems more complex and > less robust/versatile, at least for offloading and capacity gauging. We have implemented TMO PSI-based proactive reclaim and compared it to a kstaled-based reclaimer (reclaiming based on 2 minute working set and refaults). The PSI-based reclaimer was able to save more memory, but it also caused spikes of refaults and a lot higher decompressions/second. Overall the test workloads had better performance with the kstaled-based reclaimer. The conclusion was that it was a trade-off. Since we have some app classes that we don't want to induce pressure but still want to proactively reclaim from, there's a missing piece. I do agree there's not a good in-depth comparison with prior art though.