Re: [RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 12/11/2024 12:23 AM, SeongJae Park wrote:
Hello Raghavendra,


Thank you for posting this nice patch series.  I gave you some feedback
offline.  Adding those here again for transparency on this grateful public
discussion.

On Sun, 1 Dec 2024 15:38:08 +0000 Raghavendra K T <raghavendra.kt@xxxxxxx> wrote:

Introduction:
=============
This patchset is an outcome of an ongoing collaboration between AMD and Meta.
Meta wanted to explore an alternative page promotion technique as they
observe high latency spikes in their workloads that access CXL memory.

In the current hot page promotion, all the activities including the
process address space scanning, NUMA hint fault handling and page
migration is performed in the process context. i.e., scanning overhead is
borne by applications.

Yet another approach is using DAMON.  DAMON does access monitoring, and further
allows users to request access pattern-driven system operations in name of
DAMOS (Data Access Monitoring-based Operation Schemes).  Using it, users can
request DAMON to find hot pages and promote, while finding cold pages and
demote.  SK hynix has made their CXL-based memory capacity expansion solution
in the way (https://github.com/skhynix/hmsdk/wiki/Capacity-Expansion).  We
collaboratively developed new DAMON features for that, and those are all
in the mainline since Linux v6.11.
> I also proposed an idea for advancing it using DAMOS auto-tuning on more
general (>2 tiers) setup
(https:lore.kernel.org/20231112195602.61525-1-sj@xxxxxxxxxx).  I haven't had a
time to further implement and test the idea so far, though.


This is an early RFC patch series to do (slow tier) CXL page promotion.
The approach in this patchset assists/addresses the issue by adding PTE
Accessed bit scanning.

Scanning is done by a global kernel thread which routinely scans all
the processes' address spaces and checks for accesses by reading the
PTE A bit. It then migrates/promotes the pages to the toptier node
(node 0 in the current approach).

Thus, the approach pushes overhead of scanning, NUMA hint faults and
migrations off from process context.

DAMON also uses PTE A bit as major source of the access information.  And DAMON
does both access monitoring and promotion/demotion in a global kernel thread,
namely kdamond.  Hence the DAMON-based approach would also offload the
overheads from process context.  So I feel your approach has a sort of
similarity with DAMON-based one in a way, and we might have a chance to avoid
unnecessary duplicates.

[...]

Limitations:
===========
PTE A bit scanning approach lacks information about exact destination
node to migrate to.

This is same for DAMON-based approach, since DAMON also uses PTE A bit as the
major source of the information.  We aim to extend DAMON to aware of the access
source CPU, and use it for solving this problem, though.  Utilizing page faults
or AMD IBS-like h/w features are on the table of the ideas.


Notes/Observations on design/Implementations/Alternatives/TODOs...
================================
1. Fine-tuning scan throttling

DAMON allows users set the upper-limit of monitoring overhead, using
max_nr_regions parameter.  Then it provides its best-effort accuracy.  We also
have ongoing projects for making it more accurate and easier to tune.


2. Use migrate_balanced_pgdat() to balance toptier node before migration
  OR Use migrate_misplaced_folio_prepare() directly.
  But it may need some optimizations (for e.g., invoke occasionaly so
that overhead is not there for every migration).

3. Explore if a separate PAGE_EXT flag is needed instead of reusing
PAGE_IDLE flag (cons: complicates PTE A bit handling in the system),
But practically does not look good idea.

4. Use timestamp information-based migration (Similar to numab mode=2).
instead of migrating immediately when PTE A bit set.
(cons:
  - It will not be accurate since it is done outside of process
context.
  - Performance benefit may be lost.)

DAMON provides a sort of time-based aggregated monitoring results.  And DAMOS
provides prioritization of pages based on the access temperature.  Hence,
DAMON-based apparoach can also be used for a similar purpose (promoting not
every accessed pages but pages that more frequently used for longer time).


5. Explore if we need to use PFN information + hash list instead of
simple migration list. Here scanning is directly done with PFN belonging
to CXL node.

DAMON supports physical address space monitoring, and maintains the access
monitoring results in its own data structure called damon_region.  So I think
similar benefit can be achieved using DAMON?

[...]
8. Using DAMON APIs OR Reusing part of DAMON which already tracks range of
physical addresses accessed.

My biased humble opinion is that it would be very nice to explore this
opportunity, since I show some similarities and opportunities to solve some of
challenges on your approach in an easier way.  Even if it turns out that DAMON
cannot be used for your use case, failing earlier is a good thing, I'd say :)


9. Gregory has nicely mentioned some details/ideas on different approaches in
[1] : development notes, in the context of promoting unmapped page cache folios.

DAMON supports monitoring accesses to unmapped page cache folios, so hopefully
DAMON-based approaches can also solve this issue.


Hello SJ,

Thank you for detailed explanation again. (Sorry for late
acknowledgement as I was looking forward to MM alignment discussion when
this message came).

I think once the direction is fixed, we could surely use / Reuse lot
source code from DAMON, MGLRU. Amazing design of DAMON should surely
help. Will keep in mind all the points raised here.

Thanks and Regards
- Raghu




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux