RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON

SeongJae Park <sj@xxxxxxxxxx> · Thu, 21 Mar 2024 16:10:19 -0700

On Wed, 20 Mar 2024 12:31:17 +0000 "Prasad, Aravinda" <aravinda.prasad@xxxxxxxxx> wrote:

> 
> 
> > -----Original Message-----
> > From: SeongJae Park <sj@xxxxxxxxxx>
> > Sent: Tuesday, March 19, 2024 10:51 AM
> > To: Prasad, Aravinda <aravinda.prasad@xxxxxxxxx>
> > Cc: damon@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; sj@xxxxxxxxxx; linux-
> > kernel@xxxxxxxxxxxxxxx; s2322819@xxxxxxxx; Kumar, Sandeep4
> > <sandeep4.kumar@xxxxxxxxx>; Huang, Ying <ying.huang@xxxxxxxxx>; Hansen,
> > Dave <dave.hansen@xxxxxxxxx>; Williams, Dan J <dan.j.williams@xxxxxxxxx>;
> > Subramoney, Sreenivas <sreenivas.subramoney@xxxxxxxxx>; Kervinen, Antti
> > <antti.kervinen@xxxxxxxxx>; Kanevskiy, Alexander
> > <alexander.kanevskiy@xxxxxxxxx>
> > Subject: Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
> > 
> > Hi Aravinda,
> > 
> > 
> > Thank you for posting this new revision!
> > 
> > I remember I told you that I don't see a high level significant problems on on the
> > reply to the previous revision of this patch[1], but I show a concern now.
> > Sorry for not raising this earlier, but let me explain my humble concerns before
> > being even more late.
> 
> Please find my comments below:
> 
> > 
> > On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad
> > <aravinda.prasad@xxxxxxxxx> wrote:
> > 
> > > DAMON randomly samples one or more pages in every region and tracks
> > > accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> > > When the region size is large (e.g., several GBs), which is common for
> > > large footprint applications, detecting whether the region is accessed
> > > or not completely depends on whether the pages that are actively
> > > accessed in the region are picked during random sampling.
> > > If such pages are not picked for sampling, DAMON fails to identify the
> > > region as accessed. However, increasing the sampling rate or
> > > increasing the number of regions increases CPU overheads of kdamond.
> > 
> > DAMON uses sampling because it considers a region as accessed if a portion of
> > the region that big enough to be detected via sampling is all accessed.  If a region
> > is having some pages that really accessed but the proportion is too small to be
> > found via sampling, I think DAMON could say the overall access to the region is
> > only modest and could even be ignored.  In my humble opinion, this fits with the
> > definition of DAMON region: A memory address range that constructed with
> > pages having similar access frequency.
> 
> Agree that DAMON considers a region as accessed if a good portion of the region
> is accessed. But few points I would like to discuss:
> 
> For large regions (say 10GB, that has 2,621,440 4K pages), sampling at PTE level
> will not cover a good portion of the region. For example, default 5ms sampling
> and 100ms aggregation samples only 20 4K pages in an aggregation interval. 

If the 20 attempts all failed at finding any single accessed 4K page, I think
it roughly means less than 5% of the region is accessed within the
user-specified time (aggregation interval).  I would translate that as only
tiny portion of the region is accessed within the user-specified time, and
hence DAMON is ok to say the region is nearly not accessed.

> Increasing sampling to 1 ms and aggregation to 1 second can only cover 
> 1000 4K pages, but results in higher CPU overheads due to frequent sampling. 
> Even increasing the aggregation interval to 60 seconds but sampling at 5ms can
> only cover 12000 samples, but region splitting and merging happens once
> in 60 seconds.

At the beginning of each sampling interval, DAMON randomly picks one page per
region, clear their accessed bits, wait until the sampling interval is
finished, and check the accessed bits again.  In other words, DAMON shows only
accesses that made in last sampling interval.

Increasing number of samples per aggregation interval can help DAMON knows the
access frequency of regions in finer granularity, but doesn't allow DAMON see
more accesses.  Rather than that, if the aggregation interval is fixed
(reducing sampling interval), DAMON can show even less amount of accesses.

What we need here is giving the workload longer sampling time so that the
workload can make access to a size of memory regions that large enough to be
found by DAMON.

> 
> In addition, this worsens when region sizes are say 100GB+. We observe that
> sampling at PTE level does not help for large regions as more samples are
> are required. So, decreasing/increasing the sampling or aggressions intervals
> proportional to the region size is not practical as all regions are of not equal
> size, we can have 100GB regions as well as many small regions (e.g., 16KB
> to 1MB).

IMO, it becomes worse because the minimum size of accessed memory regions that
can be found by DAMON via sampling has increased together, while you didn't
give more sampling time (a.k.a the time to let the workload make accesses that
DAMON can show).

> So tuning sampling rate and aggregation interval did not help for large
> regions.

Due to the mechanism of the DAMON's sampling I mentioned above, I think this is
what expected.  We need to increase sampling interval.

> 
> It can also be observed that large regions cannot be avoided. Large regions
> are created by merging adjacent smaller regions or at the beginning of
> profiling (depending on min regions parameter which defaults to 10). 
> Increasing min region reduces the size of regions but increases kdamond
> overheads, hence, not preferable.
> 
> So, sampling at PTE level cannot precisely detect accesses to large regions
> resulting in inaccuracies, even though it works for small regions. 
> From our experiments, we found that with 10% hot data in a large region
> (80+GB regions in a 1TB+ footprint application), DAMON was not able to
> detect a single access to that region in 95+% cases with different sample
> and aggregation interval combinations. But DAMON works good for
> applications with footprint <50GB where regions are typically small.
> 
> Now consider the scenario with the proposed enhancement. With a
> 100GB region, if we sample a PUD entry that covers 1GB address
> space, then the default 5ms sampling and 100ms aggregation samples
> 20 PUD entries that is 20 GB portion of the region. This gives a good
> estimate of the portion of the region that is accessed. But the downside
> is that as PUD accessed bit is set even if a small set of pages are accessed
> under its subtree this can report more access as real as you noted.
> 
> But such large regions are split into two in the next aggregation interval. 
> As the splitting of the regions continues, in next few aggregation intervals
> many smaller regions are formed. Once smaller regions are formed,
> the proposed enhancement cannot pick higher levels of the page table
> tree and behaves as good as default DAMON. So, with the proposed
> enhancement, larger regions are quickly split into smaller regions if they
> have only small set of pages accessed.

I fully agree.  This is what could be a real and important benefits.

> 
> To avoid misinterpreting region access count, I feel that the "age" of the
> region is of real help and should be considered by both DAMON and the
> proposed enhancement. If the age of a region is small (<10) then that
> region should not be considered stable and hence should not be
> considered for any memory tiering decisions. For regions with age, 
> say >10, can be considered as stable as they reflect the actual access
> frequency.

I think this is a good approach, but difficult to be used by default.  I think
we might be able to get the benefit without making problem at the
over-reporting accesses by using the high level accessed bit check results as a
hint for better quality of region split?

Also, if we can allow large enough age, the random region split will eventually
find the small hot regions even without high level accessed bit hint.  Of
course the hint could help finding it earlier.  I think that was one of my
comment on the first version of this patch.

> 
> > 
> > >
> > > This patch proposes profiling different levels of the
> > > application\u2019s page table tree to detect whether a region is
> > > accessed or not. This patch set is based on the observation that, when
> > > the accessed bit for a page is set, the accessed bits at the higher
> > > levels of the page table tree (PMD/PUD/PGD) corresponding to the path
> > > of the page table walk are also set. Hence, it is efficient to check
> > > the accessed bits at the higher levels of the page table tree to
> > > detect whether a region is accessed or not. For example, if the access
> > > bit for a PUD entry is set, then one or more pages in the 1GB PUD
> > > subtree is accessed as each PUD entry covers 1GB mapping. Hence,
> > > instead of sampling thousands of 4K/2M pages to detect accesses in a
> > > large region, sampling at the higher level of page table tree is faster and
> > efficient.
> > 
> > Due to the above reason, I concern this could result in making DAMON monitoring
> > results be inaccurately biased to report more than real accesses.
> 
> DAMON, even without the proposed enhancement, can result in inaccuracies
> for large regions, (see examples above).

I think temporarily missing such tiny portion of accesses is not a critical
problem.  If this is a problem, the user should increase the sampling interval
in my opinion.  That said, as mentioned above, DAMON would better to improve
its regions split mechanism.

> 
> > 
> > >
> > > This patch set is based on 6.8-rc5 kernel (commit: f48159f8,
> > > mm-unstable
> > > tree)
> > >
> > > Changes since v1 [1]
> > > ====================
> > >
> > >  - Added support for 5-level page table tree
> > >  - Split the patch to mm infrastructure changes and DAMON enhancements
> > >  - Code changes as per comments on v1
> > >  - Added kerneldoc comments
> > >
> > > [1] https://lkml.org/lkml/2023/12/15/272
> > >
> > > Evaluation:
> > >
> > > - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> > >   and 5TB with 10GB hot data.
> > > - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> > >   parameters set to default value.
> > > - DAMON+PTP: Page table profiling applied to DAMON with the above
> > >   parameters.
> > >
> > > Profiling efficiency in detecting hot data:
> > >
> > > Footprint	1GB	10GB	100GB	5TB
> > > ---------------------------------------------
> > > DAMON		>90%	<50%	 ~0%	  0%
> > > DAMON+PTP	>90%	>90%	>90%	>90%
> > 
> > Sampling interval is the time interval that assumed to be large enough for the
> > workload to make meaningful amount of accesses within the interval.  Hence,
> > meaningful amount of sampling interval depends on the workload's characteristic
> > and system's memory bandwidth.
> > 
> > Here, the size of the hot memory region is about 100MB, 1GB, 10GB, and 10GB
> > for the four cases, respectively.  And you set the sampling interval as 5ms.  Let's
> > assume the system can access, say, 50 GB per second, and hence it could be able
> > to access only up to 250 MB per 5ms.  So, in case of 1GB and footprint, all hot
> > memory region would be accessed while DAMON is waiting for next sampling
> > interval.  Hence, DAMON would be able to see most accesses via sampling.  But
> > for 100GB footprint case, only 250MB / 10GB = about 2.5% of the hot memory
> > region would be accessed between the sampling interval.  DAMON cannot see
> > whole accesses, and hence the precision could be low.
> > 
> > I don't know exact memory bandwith of the system, but to detect the 10 GB hot
> > region with 5ms sampling interval, the system should be able to access 2GB
> > memory per millisecond, or about 2TB memory per second.  I think systems of
> > such memory bandwidth is not that common.
> > 
> > I show you also explored a configuration setting the aggregation interval higher.
> > But because each sampling checks only access between the sampling interval,
> > that might not help in this setup.  I'm wondering if you also explored increasing
> > sampling interval.
> >
> 
> What we have observed that many real-world benchmarks we experimented
> with do not saturate the memory bandwidth. We also experimented with
> masim microbenchmark to understand the impact on memory access rate
> (we inserted delay between memory access operations in do_rnd_ro() and
> other functions). We see decrease in the precision as access intensity is
> reduced. We have experimented with different sampling and aggregation
> intervals, but that did not help much in improving precision. 

Again, please note that DAMON can show only accesses made between each sampling
interval at a time.  The important factor for expectation of DAMON's accuracy
is, the balance between the memory access intensity of the workload, and the
length of the sampling interval.  The workload should be access intensive
enough to make sufficient amount of accesses between sampling interval.  The
sampling interval should be long enough to allow the workload makes sufficient
amount of accesses within the time interval.

The fact that the workloads were not saturating the memory bandwidth is not
enough to know if that means the workload was memory intensive enough, and the
sampling interval was long enough.

I was mentioning the memory bandwidth as only the maximum memory intensity of
the system that could be achieved.

> 
> So, what I think is it that most of the cases the precision depends on the page
> (hot or cold) that is randomly picked for sampling than the sampling rate. Most
> of the time only cold 4K pages are picked in a large region as they typically
> account for 90% of the pages in the region and hence DAMON does not
> detect any accesses at all. By profiling higher levels of the page table tree
> this can be improved.

Again, agreed.  This is an important and grateful finding.  Thank you.  And
again as mentioned above, I don't think we can merge this patch as is, but we
could think about using the high level access bit check results as a hint to
better split the regions.

Indeed, DAMON's monitoring mechanism has many rooms for improvements.  I also
have some ideas but my time was more spent on more capabilities of DAMON/DAMOS
so far.  It was a bit intentional proiority setting since I got no real DAMON
accuracy problem report from the production usage, and improving the accuracy
will deliver the benefit to all DAMON/DAMOS features.

Since an important milestone of DAMOS, namely auto-tuning, has merged into the
mainline, I think I may better to spend more time on monitoring accuracy
improvement.  I have some immature ideas in my head.  I will try to summarize
and share the ideas in near future.

>  
> > Sorry again for finding this concern not early enough.  But I think we may need to
> > discuss about this first.
> 
> Absolutely no problem. Please let me know your thoughts.

Thank you for patiently walking with me :)

Thanks,
SJ

> 
> Regards,
> Aravinda
> 
> > 
> > [1] https://lkml.kernel.org/r/20231215201159.73845-1-sj@xxxxxxxxxx
> > 
> > 
> > Thanks,
> > SJ
> > 
> > 
> > >
> > > CPU overheads (in billion cycles) for kdamond:
> > >
> > > Footprint	1GB	10GB	100GB	5TB
> > > ---------------------------------------------
> > > DAMON		1.15	19.53	3.52	9.55
> > > DAMON+PTP	0.83	 3.20	1.27	2.55
> > >
> > > A detailed explanation and evaluation can be found in the arXiv paper:
> > > https://arxiv.org/pdf/2311.10275.pdf
> > >
> > >
> > > Aravinda Prasad (3):
> > >   mm/damon: mm infrastructure support
> > >   mm/damon: profiling enhancement
> > >   mm/damon: documentation updates
> > >
> > >  Documentation/mm/damon/design.rst |  42 ++++++
> > >  arch/x86/include/asm/pgtable.h    |  20 +++
> > >  arch/x86/mm/pgtable.c             |  28 +++-
> > >  include/linux/mmu_notifier.h      |  36 +++++
> > >  include/linux/pgtable.h           |  79 ++++++++++
> > >  mm/damon/vaddr.c                  | 233 ++++++++++++++++++++++++++++--
> > >  6 files changed, 424 insertions(+), 14 deletions(-)
> > >
> > > --
> > > 2.21.3