Re: [LSF/MM/BPF TOPIC] Overhauling hot page detection and promotion based on PTE A bit scanning

SeongJae Park <sj@xxxxxxxxxx> · Mon, 27 Jan 2025 10:34:03 -0800

On Mon, 27 Jan 2025 10:41:07 +0530 Bharata B Rao <bharata@xxxxxxx> wrote:

> On 26-Jan-25 7:57 AM, Huang, Ying wrote:
> > Hi, Raghavendra,
> > 
> > Raghavendra K T <raghavendra.kt@xxxxxxx> writes:
> > 
> >> Bharata and I would like to propose the following topic for LSFMM.
> >>
> >> Topic: Overhauling hot page detection and promotion based on PTE A bit scanning.
> >>   
> >> In the Linux kernel, hot page information can potentially be obtained from
> >> multiple sources:
> >>   
> >> a. PROT_NONE faults (NUMA balancing)
> >> b. PTE Access bit (LRU scanning)
> >> c. Hardware provided page hotness info (like AMD IBS)
> >>   
> >> This information is further used to migrate (or promote) pages from slow memory
> >> tier to top tier to increase performance.
> >>
> >> In the current hot page promotion mechanism, all the activities including the
> >> process address space scanning, NUMA hint fault handling and page migration are
> >> performed in the process context. i.e., scanning overhead is borne by the
> >> applications.
> >>   
> >> I had recently posted a patch [1] to improve this in the context of slow-tier
> >> page promotion. Here, Scanning is done by a global kernel thread which routinely
> >> scans all the processes' address spaces and checks for accesses by reading the
> >> PTE A bit. The hot pages thus identified are maintained in list and subsequently
> >> are promoted to a default top-tier node. Thus, the approach pushes overhead of
> >> scanning, NUMA hint faults and migrations off from process context.
> > 
> > This has been discussed before too.  For example, in the following thread
> > 
> > https://lore.kernel.org/all/20200417100633.GU20730@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/
> 
> Thanks for pointing to this discussion.
> 
> > 
> > The drawbacks of asynchronous scanning including
> > 
> > - The CPU cycles used are not charged properly
> > 
> > - There may be no idle CPU cycles to use
> > 
> > - The scanning CPU may be not near the workload CPUs enough
> > 
> > It's better to involve Mel and Peter in the discussion for this.
> 
> They are CC'ed in this thread and hopefully have insights to share.
> 
> Charging CPU cycles to the right process has been brought up in other 
> similar contexts. Recent one is from page migration batching and using 
> multiple threads for migration - 
> https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@xxxxxxxxxxxxxx/
> 
> Does it make sense to treat hot page promotion from slow tiers 
> differently compared to locality based balancing? I mean couldn't the 
> charging of this async thread be similar to the cycles spent by other 
> system threads like kcompactd and khugepaged?

I'm up to this idea.

I agree the fairness is a thing that we need to aware of.  But IMHO, it is
something that the async approach can further be advanced for, not a strict
blocker for now.

> 
> > 
> >> The topic was presented in the MM alignment session hosted by David Rientjes [2].
> >> The topic also finds a mention in S J Park's LSFMM proposal [3].
> >>   
> >> Here is the list of potential discussion points:
> >> 1. Other improvements and enhancements to PTE A bit scanning approach. Use of
> >> multiple kernel threads, throttling improvements, promotion policies, per-process
> >> opt-in via prctl, virtual vs physical address based scanning, tuning hot page
> >> detection algorithm etc.
> > 
> > One drawback of physical address based scanning is that it's hard to
> > apply some workload specific policy.  For example, if a low priority
> > workload has many relatively hot pages, while a high priority workload
> > has many relative warm (not so hot) pages.  We need to promote the warm
> > pages in the high priority workload, while physcial address based
> > scanning may report the hot pages in the low priority workload.  Right?
> 
> Correct. I wonder if DAMON has already devised a scheme to address this. SJ?

Yes, I think DAMOS quotas and DAMOS filters can be used to address this issue.

For this case, assuming each workload has its own cgroup, users can add a DAMOS
scheme for promotion per workload.  The schemes will have different DAMOS
quotas based on the workloads' priority.  The schemes will also be controlled
to do the promotion for pages of the specific workloads using DAMOS filters.

For example, below kdamond configuration can be used.

# damo args damon \
	--damos_action migrate_hot 0 --damos_quotas 100ms 1G 1s 0% 100% 100% \
	--damos_filter reject none memcg /workloads/high-priority \
	\
	--damos_action migrate_hot 0 --damos_quotas 10ms 100M 1s 0% 100% 100% \
	--damos_filter reject none memcg /workloads/low-priority \
	--damos_nr_filters 1 1 --out kdamond.json
# damo report damon --input_file ./kdamond.json --damon_params_omit_defaults
kdamond 0
    context 0
        ops: paddr
        target 0
            region [4,294,967,296, 68,577,918,975) (59.868 GiB)
        intervals: sample 5 ms, aggr 100 ms, update 1 s
        nr_regions: [10, 1,000]
        scheme 0
            action: migrate_hot to node 0 per aggr interval
            target access pattern
                sz: [0 B, max]
                nr_accesses: [0 %, 18,446,744,073,709,551,616 %]
                age: [0 ns, max]
            quotas
                100 ms / 1024.000 MiB / 0 B per 1 s
                priority: sz 0 %, nr_accesses 100 %, age 100 %
            filter 0
                reject none memcg /workloads/high-priority
        scheme 1
            action: migrate_hot to node 0 per aggr interval
            target access pattern
                sz: [0 B, max]
                nr_accesses: [0 %, 18,446,744,073,709,551,616 %]
                age: [0 ns, max]
            quotas
                10 ms / 100.000 MiB / 0 B per 1 s
                priority: sz 0 %, nr_accesses 100 %, age 100 %
            filter 0
                reject none memcg /workloads/low-priority

Please note that this is just one example based on existing DAMOS features.
This may have drawbacks and future optimizations would be possible.

Thanks,
SJ

> 
> Regards,
> Bharata.