Hi Huang, Have you had a chance to look at our hot page detection patch that Hasan has sent out some time ago? [1] It hooks into page reclaim to determine what is and isn't hot. Reclaim is an existing, well-tested mechanism to do just that. It's just 13 lines of code: set active bit on the first hint fault; promote on the second one if the active bit is still set. This promotes only pages hot enough that they can compete with toptier access frequencies. It's not just convenient, it's also essential to link tier promotion rate to page aging. Tiered NUMA balancing is about establishing a global LRU order across two (or more) nodes. LRU promotions *within* a node require multiple LRU cycles with references. LRU promotions *between* nodes must follow the same rules, and be subject to the same aging pressure, or you can get much colder pages promoted into a very hot workingset and wreak havoc. We've hammered this patch quite extensively with several Meta production workloads and it's been working reliably at keeping reasonable promotion rates. @@ -4202,6 +4202,19 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) last_cpupid = page_cpupid_last(page); page_nid = page_to_nid(page); + + /* Only migrate pages that are active on non-toptier node */ + if (numa_promotion_tiered_enabled && + !node_is_toptier(page_nid) && + !PageActive(page)) { + count_vm_numa_event(NUMA_HINT_FAULTS); + if (page_nid == numa_node_id()) + count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL); + mark_page_accessed(page); + pte_unmap_unlock(vmf->pte, vmf->ptl); + goto out; + } + target_nid = numa_migrate_prep(page, vma, vmf->address, page_nid, &flags); pte_unmap_unlock(vmf->pte, vmf->ptl); [1] https://lore.kernel.org/all/20211130003634.35468-1-hasanalmaruf@xxxxxx/t/#m85b95624622f175ca17a00cc8cc0fc9cc4eeb6d2