Re: [RFC PATCH 2/4] mm: kpromoted: Hot page info collection and promotion daemon

Bharata B Rao <bharata@xxxxxxx> · Mon, 17 Mar 2025 09:19:19 +0530

On 14-Mar-25 2:06 AM, Davidlohr Bueso wrote:
On Thu, 06 Mar 2025, Bharata B Rao wrote:

+/*
+ * Go thro' page hotness information and migrate pages if required.
+ *
+ * Promoted pages are not longer tracked in the hot list.
+ * Cold pages are pruned from the list as well.
+ *
+ * TODO: Batching could be done
+ */
+static void kpromoted_migrate(pg_data_t *pgdat)
+{
+    int nid = pgdat->node_id;
+    struct page_hotness_info *phi;
+    struct hlist_node *tmp;
+    int nr_bkts = HASH_SIZE(page_hotness_hash);
+    int bkt;
+
+    for (bkt = 0; bkt < nr_bkts; bkt++) {
+        mutex_lock(&page_hotness_lock[bkt]);
+        hlist_for_each_entry_safe(phi, tmp, &page_hotness_hash[bkt], 
hnode) {
+            if (phi->hot_node != nid)
+                continue;
+
+            if (page_should_be_promoted(phi)) {
+                count_vm_event(KPROMOTED_MIG_CANDIDATE);
+                if (!kpromote_page(phi)) {
+                    count_vm_event(KPROMOTED_MIG_PROMOTED);
+                    hlist_del_init(&phi->hnode);
+                    kfree(phi);
+                }
+            } else {
+                /*
+                 * Not a suitable page or cold page, stop tracking it.
+                 * TODO: Identify cold pages and drive demotion?
+                 */

I don't think kpromoted should drive demotion at all. No one is 
complaining about migrate
in lieu of discard, and there is also proactive reclaim which users can 
trigger. All the
in-kernel problems are wrt promotion. The simpler any of these kthreads 
are the better.

I was testing on default kernel with NUMA balancing mode 2.

The multi-threaded application allocates memory on DRAM and the 
allocation spills over to CXL node. The threads keep accessing allocated 
memory pages in random order.

pgpromote_success 6
pgpromote_candidate 745387
pgdemote_kswapd 51085
pgdemote_direct 10481
pgdemote_khugepaged 0
numa_pte_updates 27249625
numa_huge_pte_updates 0
numa_hint_faults 9660745
numa_hint_faults_local 0
numa_pages_migrated 6
numa_node_full 745438
pgmigrate_success 2225458
pgmigrate_fail 1187349

I hardly see any promotion happening.

In order to check the number of times the toptier node was found to be 
full when attempting to promote, I added numa_node_full counter like below:

diff --git a/mm/migrate.c b/mm/migrate.c
index fb19a18892c8..4d049d896589 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2673,6 +2673,7 @@ int migrate_misplaced_folio_prepare(struct folio 
*folio,
        if (!migrate_balanced_pgdat(pgdat, nr_pages)) {
                int z;

+               count_vm_event(NUMA_NODE_FULL);
                if (!(sysctl_numa_balancing_mode & 
NUMA_BALANCING_MEMORY_TIERING))
                        return -EAGAIN;
                for (z = pgdat->nr_zones - 1; z >= 0; z--) {


As seen above, numa_node_full 745438. This matches pgpromote_candidate 
numbers.

I do see counters reporting kswapd-driven and direct demotion as well 
but does this mean that demotion isn't happening fast enough to cope up 
with promotion requirement in this high toptier memory pressure situation?

Regards,
Bharata.