Re: [PATCH v4 0/9] mm: workingset reporting

SeongJae Park <sj@xxxxxxxxxx> · Wed, 11 Dec 2024 11:53:29 -0800

On Fri, 6 Dec 2024 11:57:55 -0800 Yuanchu Xie <yuanchu@xxxxxxxxxx> wrote:

> Thanks for the response Johannes. Some replies inline.
> 
> On Tue, Nov 26, 2024 at 11:26\u202fPM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > > This patch series provides workingset reporting of user pages in
> > > lruvecs, of which coldness can be tracked by accessed bits and fd
> > > references. However, the concept of workingset applies generically to
> > > all types of memory, which could be kernel slab caches, discardable
> > > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > > come from slab shrinkers, device drivers, or the userspace.
> > > Another interesting idea might be hugepage workingset, so that we can
> > > measure the proportion of hugepages backing cold memory. However, with
> > > architectures like arm, there may be too many hugepage sizes leading to
> > > a combinatorial explosion when exporting stats to the userspace.
> > > Nonetheless, the kernel should provide a set of workingset interfaces
> > > that is generic enough to accommodate the various use cases, and extensible
> > > to potential future use cases.
> >
> > Doesn't DAMON already provide this information?
> >
> > CCing SJ.
> Thanks for the CC. DAMON was really good at visualizing the memory
> access frequencies last time I tried it out!

Thank you for this kind acknowledgement, Yuanchu!

> For server use cases,
> DAMON would benefit from integrations with cgroups.  The key then would be a
> standard interface for exporting a cgroup's working set to the user.

I show two ways to make DAMON supports cgroups for now.  First way is making
another DAMON operations set implementation for cgroups.  I shared a rough idea
for this before, probably on kernel summit.  But I haven't had a chance to
prioritize this so far.  Please let me know if you need more details.  The
second way is extending DAMOS filter to provide more detailed statistics per
DAMON-region, and adding another DAMOS action that does nothing but only
accounting the detailed statistics.  Using the new DAMOS action, users will be
able to know how much of specific DAMON-found regions are filtered out by the
given filter.  Because we have DAMOS filter type for cgroups, we can know how
much of workingset (or, warm memory) belongs to specific groups.  This can be
applied to not only cgroups, but for any DAMOS filter types that exist (e.g.,
anonymous page, young page).

I believe the second way is simpler to implement while providing information
that sufficient for most possible use cases.  I was anyway planning to do this.

> It would be good to have something that will work for different
> backing implementations, DAMON, MGLRU, or active/inactive LRU.

I think we can do this using the filter statistics, with new filter types.  For
example, we can add new DAMOS filter that filters pages if it is for specific
range of MGLRU-gen of the page, or whether the page belongs to active or
inactive LRU lists.

> 
> >
> > > Use cases
> > > ==========
[...]
> > Access frequency is only half the picture. Whether you need to keep
> > memory with a given frequency resident depends on the speed of the
> > backing device.
[...]
> > > Benchmarks
> > > ==========
> > > Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Linux
> > > compile and redis benchmarks from openbenchmarking.org. The policy and
> > > runner is referred to as WMO (Workload Memory Optimization).
> > > The results were based on v3 of the series, but v4 doesn't change the core
> > > of the working set reporting and just adds the ballooning counterpart.
> > >
> > > The timed Linux kernel compilation benchmark shows improvements in peak
> > > memory usage with a policy of "swap out all bytes colder than 10 seconds
> > > every 40 seconds". A swapfile is configured on SSD.
[...]
> > You can do this with a recent (>2018) upstream kernel and ~100 lines
> > of python [1]. It also works on both LRU implementations.
> >
> > [1] https://github.com/facebookincubator/senpai
> >
> > We use this approach in virtually the entire Meta fleet, to offload
> > unneeded memory, estimate available capacity for job scheduling, plan
> > future capacity needs, and provide accurate memory usage feedback to
> > application developers.
> >
> > It works over a wide variety of CPU and storage configurations with no
> > specific tuning.
> >
> > The paper I referenced above provides a detailed breakdown of how it
> > all works together.
> >
> > I would be curious to see a more in-depth comparison to the prior art
> > in this space. At first glance, your proposal seems more complex and
> > less robust/versatile, at least for offloading and capacity gauging.
> We have implemented TMO PSI-based proactive reclaim and compared it to
> a kstaled-based reclaimer (reclaiming based on 2 minute working set
> and refaults). The PSI-based reclaimer was able to save more memory,
> but it also caused spikes of refaults and a lot higher
> decompressions/second. Overall the test workloads had better
> performance with the kstaled-based reclaimer. The conclusion was that
> it was a trade-off.

I agree it is only half of the picture, and there could be tradeoff.  Motivated
by those previous works, DAMOS provides PSI-based aggressiveness auto-tuning to
use both ways.

> I do agree there's not a good in-depth comparison
> with prior art though.

I would be more than happy to help the comparison work agains DAMON of current
implementation and future plans, and any possible collaborations.

Thanks,
SJ