RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON

SeongJae Park <sj@xxxxxxxxxx> · Fri, 22 Mar 2024 11:32:36 -0700

On Fri, 22 Mar 2024 12:12:09 +0000 "Prasad, Aravinda" <aravinda.prasad@xxxxxxxxx> wrote:

[...] 
> > > For large regions (say 10GB, that has 2,621,440 4K pages), sampling at
> > > PTE level will not cover a good portion of the region. For example,
> > > default 5ms sampling and 100ms aggregation samples only 20 4K pages in an
> > > aggregation interval.
> > 
> > If the 20 attempts all failed at finding any single accessed 4K page, I think it
> > roughly means less than 5% of the region is accessed within the user-specified
> > time (aggregation interval).  I would translate that as only tiny portion of the

I now find the above sentence is not correct.  Sorry, my bad.  Let me re-write.

I think it roughly means the workload is not accessing the region in a
frequency that high enough for DAMON to observe within the user-specified time
(sampling interval).

> > region is accessed within the user-specified time, and hence DAMON is ok to say
> > the region is nearly not accessed.
> 
> I am looking at it from the other way:
> 
> To detect if a region is hot or cold at least 1% of the pages in the region should
> be sampled. For a 10GB region (with 2,621,440 4K pages) this requires sampling
> at least 26,214 pages. For a 100GB region this will require sampling at least
> 262,144 pages.

Why do you think 1% of the pages should be sampled?

DAMON defines the region as an address range that contains pages having similar
access frequency.  Hence if we see a page was accessed within a given time
interval, we can assume all pages in the page is also accessed within the
interval, and vice versa.  That's why we sample only single page per region,
and how DAMON's monitoring overhead can be controlled regardless of the size of
the monitoring target memory.

To detect if the region is hot or cold, DAMON continues sampling multiple times
and use number of sampling intervals that seen the access to the region
(nr_accesses) as the relative hotness of the region.  Hence, how many sampling
is required depends on what precision of the relative hotness the user wants.
The size of the region doesn't matter here.

Am I missing something?

> 
> If we sample at 5ms, this takes 131.072 seconds to cover 1% of 10GB and 1310.72
> seconds to cover 100GB. 
> 
> DAMON shows that the selected page as accessed if that page was accessed
> during the 5ms sampling window. Now if we increase the sampling to 20ms to
> improve access detection, then covering 1% of the region takes even longer.
> 
> > 
> > > Increasing sampling to 1 ms and aggregation to 1 second can only cover
> > > 1000 4K pages, but results in higher CPU overheads due to frequent sampling.
> > > Even increasing the aggregation interval to 60 seconds but sampling at
> > > 5ms can only cover 12000 samples, but region splitting and merging
> > > happens once in 60 seconds.
> > 
> > At the beginning of each sampling interval, DAMON randomly picks one page per
> > region, clear their accessed bits, wait until the sampling interval is finished, and
> > check the accessed bits again.  In other words, DAMON shows only accesses that
> > made in last sampling interval.
> 
> Yes, I see this in the code:
> 
> while(time < aggregation_interval)
> {
>   clear_access_bit
>   sleep(sampling_time)
>   check_access_bit
> }
> 
> I would suggest this logic instead.
> 
> while(time < aggregation_interval)
> {
>   Number_of_samples = aggregation_interval / sampling_time;
> 
>   for (i = 0, I < number_of_samples; i++) 
>   {
>     clear_access_bit
>   } 
> 
>   sleep(aggregation_time)
> 
>   for (i = 0, I < number_of_samples; i++) 
>   {
>     check_access_bit
>   }
> }
> 
> This can help in better access detection. I am sure you would
> have already explored it.   

The way to detect the access in the region is implemented by each monitoring
operations set (vaddr, fvaddr, and paddr).  We could implement yet another
monitoring operations set with a new access detection method.  Nonetheless, I
think changing existing monitoring operations sets to use this suggestion while
keeping their concepts would be not easy.

> 
> > 
> > Increasing number of samples per aggregation interval can help DAMON knows
> > the access frequency of regions in finer granularity, but doesn't allow DAMON see
> > more accesses.  Rather than that, if the aggregation interval is fixed (reducing
> > sampling interval), DAMON can show even less amount of accesses.
> > 
> > What we need here is giving the workload longer sampling time so that the
> > workload can make access to a size of memory regions that large enough to be
> > found by DAMON.
> 
> But even with longer sampling time, we may miss the access. For example, 
> consider all the pages in the region are accessed sequentially. Now if DAMON samples
> a different page other than the page that is being accessed it will miss. Now even if we
> have longer sampling time it is possible that none of the accesses are detected.

If there was accesses to some pages of the region but unaccessed page has
picked as the sampling target, someone could say only a tiny portion of the
region is accessed, so we can treat the region as not accessed at all.  That's
at least what the monitoring operations set you use here ('vaddr') thinks.

[...]
> > Also, if we can allow large enough age, the random region split will eventually find
> > the small hot regions even without high level accessed bit hint.  Of course the hint
> > could help finding it earlier.  I think that was one of my comment on the first
> > version of this patch.
> 
> The problem is that a large region that is split is immediately merged as the split
> regions have access count zero.
> 
> We observe that large regions are never getting split at all due to this.

I understand this is a valid concern.  Especially because we currently split
each region into two sub-regions, finding small hot memory region in the middle
of a huge region could be challenging.  This concern has raised before DAMON
has merged into the mainline by Jonathan Cameron.  There was also a research
from my previous colleague saying incresing the sub-regions for each split
improves the accuracy.  Nonetheless, it increases overall number of regions and
hence increased the overhead.  And we didn't get real issue due to this from
the production so far, so we still keeping the old behavior.  I'm thinking
about a way to make this better.

That said, the real system would have more than the single region, and the
access pattern will be more dynamic.  It will cause the region to be merged and
split in more random and chaotic way.  Hence I think there is still a chance to
find the small hot portion eventually.  Also, the sampling regions are picked
randomly.  A page of the small hot portion will eventually picked as sampling
target, even multiple times, and at least reset the 'age' of the region.

I sometimes turn on DAMON to monitor entire physical address space (about 128
GiB) of my machine and run no active workload but just a few background
deamons.  So the system would have only small amount of accesses.  At the
beginning, the monitoring output shows all regions as not accessed (nr_accesses
0) and having same 'age'.  But as time goes by, the regions are still showing
no access (nr_accesses 0), but different ages and sizes.

Again, I'm not saying existing monitoring mechanism is perfect and optimum.  We
should continue optimizing it.  Nonetheless, the current accuracy is not
perfectly proved to be too awful to be used in real world.  I know at least a
few unnamed production usages of DAMON, and they didn't complained about
DAMON's accuracy so far.

Thanks,
SJ

> 
> Regards,
> Aravinda
[...]