Byungchul Park <byungchul@xxxxxx> writes: > Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'. > > We always face the migration overhead at either promotion or demotion, > while working with tiered memory e.g. CXL memory and found out TLB > shootdown is a quite big one that is needed to get rid of if possible. > > Fortunately, TLB flush can be defered or even skipped if both source and > destination of folios during migration are kept until all TLB flushes > required will have been done, of course, only if the target PTE entries > have read only permission, more precisely speaking, don't have write > permission. Otherwise, no doubt the folio might get messed up. > > To achieve that: > > 1. For the folios that have only non-writable TLB entries, prevent > TLB flush by keeping both source and destination of folios during > migration, which will be handled later at a better time. > > 2. When any non-writable TLB entry changes to writable e.g. through > fault handler, give up CONFIG_MIGRC mechanism so as to perform > TLB flush required right away. > > 3. TLB flushes can be skipped if all TLB flushes required to free the > duplicated folios have been done by any reason, which doesn't have > to be done from migrations. > > 4. Adjust watermark check routine, __zone_watermark_ok(), with the > number of duplicated folios because those folios can be freed > and obtained right away through appropreate TLB flushes. > > 5. Perform TLB flushes and free the duplicated folios pending the > flushes if page allocation routine is in trouble due to memory > pressure, even more aggresively for high order allocation. Is the optimization restricted for page migration only? Can it be used for other places? Like page reclaiming? > The measurement result: > > Architecture - x86_64 > QEMU - kvm enabled, host cpu, 2nodes((4cpus, 2GB)+(cpuless, 6GB)) > Linux Kernel - v6.4, numa balancing tiering on, demotion enabled > Benchmark - XSBench with no parameter changed > > run 'perf stat' using events: > (FYI, process wide result ~= system wide result(-a option)) > 1) itlb.itlb_flush > 2) tlb_flush.dtlb_thread > 3) tlb_flush.stlb_any > > run 'cat /proc/vmstat' and pick up: > 1) pgdemote_kswapd > 2) numa_pages_migrated > 3) pgmigrate_success > 4) nr_tlb_remote_flush > 5) nr_tlb_remote_flush_received > 6) nr_tlb_local_flush_all > 7) nr_tlb_local_flush_one > > BEFORE - mainline v6.4 > ========================================== > > $ perf stat -e itlb.itlb_flush,tlb_flush.dtlb_thread,tlb_flush.stlb_any ./XSBench > > Performance counter stats for './XSBench': > > 426856 itlb.itlb_flush > 6900414 tlb_flush.dtlb_thread > 7303137 tlb_flush.stlb_any > > 33.500486566 seconds time elapsed > 92.852128000 seconds user > 10.526718000 seconds sys > > $ cat /proc/vmstat > > ... > pgdemote_kswapd 1052596 > numa_pages_migrated 1052359 > pgmigrate_success 2161846 > nr_tlb_remote_flush 72370 > nr_tlb_remote_flush_received 213711 > nr_tlb_local_flush_all 3385 > nr_tlb_local_flush_one 198679 > ... > > AFTER - mainline v6.4 + CONFIG_MIGRC > ========================================== > > $ perf stat -e itlb.itlb_flush,tlb_flush.dtlb_thread,tlb_flush.stlb_any ./XSBench > > Performance counter stats for './XSBench': > > 179537 itlb.itlb_flush > 6131135 tlb_flush.dtlb_thread > 6920979 tlb_flush.stlb_any It appears that the number of "itlb.itlb_flush" changes much, but not for other 2 events. Because the text segment of the executable file is mapped as read-only? And most other pages are mapped read-write? > 30.396700625 seconds time elapsed > 80.331252000 seconds user > 10.303761000 seconds sys > > $ cat /proc/vmstat > > ... > pgdemote_kswapd 1044602 > numa_pages_migrated 1044202 > pgmigrate_success 2157808 > nr_tlb_remote_flush 30453 > nr_tlb_remote_flush_received 88840 > nr_tlb_local_flush_all 3039 > nr_tlb_local_flush_one 198875 > ... > > Signed-off-by: Byungchul Park <byungchul@xxxxxx> > --- > arch/x86/include/asm/tlbflush.h | 7 + > arch/x86/mm/tlb.c | 52 ++++++ > include/linux/mm.h | 30 ++++ > include/linux/mm_types.h | 34 ++++ > include/linux/mmzone.h | 6 + > include/linux/sched.h | 4 + > init/Kconfig | 12 ++ > mm/internal.h | 10 ++ > mm/memory.c | 9 +- > mm/migrate.c | 287 +++++++++++++++++++++++++++++++- > mm/mm_init.c | 1 + > mm/page_alloc.c | 16 ++ > mm/rmap.c | 92 ++++++++++ > 13 files changed, 555 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h > index 63504cde364b..da987c15049e 100644 > --- a/arch/x86/include/asm/tlbflush.h > +++ b/arch/x86/include/asm/tlbflush.h > @@ -279,9 +279,16 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, > } > > extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch); > +extern void arch_tlbbatch_clean(struct arch_tlbflush_unmap_batch *batch); > extern void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst, > struct arch_tlbflush_unmap_batch *bsrc); > > +#ifdef CONFIG_MIGRC > +extern void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen); > +#else > +static inline void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen) {} > +#endif > + > static inline bool pte_flags_need_flush(unsigned long oldflags, > unsigned long newflags, > bool ignore_access) > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c > index 69d145f1fff1..54f98a50fd59 100644 > --- a/arch/x86/mm/tlb.c > +++ b/arch/x86/mm/tlb.c > @@ -1210,9 +1210,40 @@ STATIC_NOPV void native_flush_tlb_local(void) > native_write_cr3(__native_read_cr3()); > } > > +#ifdef CONFIG_MIGRC > +DEFINE_PER_CPU(int, migrc_done); > + > +static inline int migrc_tlb_local_begin(void) > +{ > + int ret = atomic_read(&migrc_gen); > + > + smp_mb__after_atomic(); > + return ret; > +} > + > +static inline void migrc_tlb_local_end(int gen) > +{ > + smp_mb(); > + WRITE_ONCE(*this_cpu_ptr(&migrc_done), gen); > +} > +#else > +static inline int migrc_tlb_local_begin(void) > +{ > + return 0; > +} > + > +static inline void migrc_tlb_local_end(int gen) > +{ > +} > +#endif > + > void flush_tlb_local(void) > { > + unsigned int gen; > + > + gen = migrc_tlb_local_begin(); > __flush_tlb_local(); > + migrc_tlb_local_end(gen); > } > > /* > @@ -1237,6 +1268,22 @@ void __flush_tlb_all(void) > } > EXPORT_SYMBOL_GPL(__flush_tlb_all); > > +#ifdef CONFIG_MIGRC > +static inline bool before(int a, int b) > +{ > + return a - b < 0; > +} > + > +void arch_migrc_adj(struct arch_tlbflush_unmap_batch *batch, int gen) > +{ > + int cpu; > + > + for_each_cpu(cpu, &batch->cpumask) > + if (!before(READ_ONCE(*per_cpu_ptr(&migrc_done, cpu)), gen)) > + cpumask_clear_cpu(cpu, &batch->cpumask); > +} > +#endif > + > void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > { > struct flush_tlb_info *info; > @@ -1265,6 +1312,11 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) > put_cpu(); > } > > +void arch_tlbbatch_clean(struct arch_tlbflush_unmap_batch *batch) > +{ > + cpumask_clear(&batch->cpumask); > +} > + > void arch_tlbbatch_fold(struct arch_tlbflush_unmap_batch *bdst, > struct arch_tlbflush_unmap_batch *bsrc) > { > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 27ce77080c79..e1f6e1fdab18 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3816,4 +3816,34 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > } > #endif > > +#ifdef CONFIG_MIGRC > +void migrc_init_page(struct page *p); > +bool migrc_pending(struct folio *f); > +void migrc_shrink(struct llist_head *h); > +void migrc_req_start(void); > +void migrc_req_end(void); > +bool migrc_req_processing(void); > +bool migrc_try_flush(void); > +void migrc_try_flush_dirty(void); > +struct migrc_req *fold_ubc_nowr_migrc_req(void); > +void free_migrc_req(struct migrc_req *req); > +int migrc_pending_nr_in_zone(struct zone *z); > + > +extern atomic_t migrc_gen; > +extern struct llist_head migrc_reqs; > +extern struct llist_head migrc_reqs_dirty; > +#else > +static inline void migrc_init_page(struct page *p) {} > +static inline bool migrc_pending(struct folio *f) { return false; } > +static inline void migrc_shrink(struct llist_head *h) {} > +static inline void migrc_req_start(void) {} > +static inline void migrc_req_end(void) {} > +static inline bool migrc_req_processing(void) { return false; } > +static inline bool migrc_try_flush(void) { return false; } > +static inline void migrc_try_flush_dirty(void) {} > +static inline struct migrc_req *fold_ubc_nowr_migrc_req(void) { return NULL; } > +static inline void free_migrc_req(struct migrc_req *req) {} > +static inline int migrc_pending_nr_in_zone(struct zone *z) { return 0; } > +#endif > + > #endif /* _LINUX_MM_H */ > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 306a3d1a0fa6..3be66d3eabd2 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -228,6 +228,10 @@ struct page { > #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS > int _last_cpupid; > #endif > +#ifdef CONFIG_MIGRC > + struct llist_node migrc_node; > + unsigned int migrc_state; > +#endif We cannot enlarge "struct page". > } _struct_page_alignment; > > /* > @@ -1255,4 +1259,34 @@ enum { > /* See also internal only FOLL flags in mm/internal.h */ > }; > > +#ifdef CONFIG_MIGRC > +struct migrc_req { > + /* > + * pages pending for TLB flush > + */ > + struct llist_head pages; > + > + /* > + * llist_node of the last page in pages llist > + */ > + struct llist_node *last; > + > + /* > + * for hanging onto migrc_reqs llist > + */ > + struct llist_node llnode; > + > + /* > + * architecture specific batch information > + */ > + struct arch_tlbflush_unmap_batch arch; > + > + /* > + * when the request hung onto migrc_reqs llist > + */ > + int gen; > +}; > +#else > +struct migrc_req {}; > +#endif > #endif /* _LINUX_MM_TYPES_H */ > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index a4889c9d4055..1ec79bb63ba7 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -958,6 +958,9 @@ struct zone { > /* Zone statistics */ > atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; > atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS]; > +#ifdef CONFIG_MIGRC > + atomic_t migrc_pending_nr; > +#endif > } ____cacheline_internodealigned_in_smp; > > enum pgdat_flags { > @@ -1371,6 +1374,9 @@ typedef struct pglist_data { > #ifdef CONFIG_MEMORY_FAILURE > struct memory_failure_stats mf_stats; > #endif > +#ifdef CONFIG_MIGRC > + atomic_t migrc_pending_nr; > +#endif > } pg_data_t; > > #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 2232b2cdfce8..d0a46089959d 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1323,6 +1323,10 @@ struct task_struct { > > struct tlbflush_unmap_batch tlb_ubc; > struct tlbflush_unmap_batch tlb_ubc_nowr; > +#ifdef CONFIG_MIGRC > + struct migrc_req *mreq; > + struct migrc_req *mreq_dirty; > +#endif > > /* Cache last used pipe for splice(): */ > struct pipe_inode_info *splice_pipe; > diff --git a/init/Kconfig b/init/Kconfig > index 32c24950c4ce..f4882c1be364 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -907,6 +907,18 @@ config NUMA_BALANCING_DEFAULT_ENABLED > If set, automatic NUMA balancing will be enabled if running on a NUMA > machine. > > +config MIGRC > + bool "Deferring TLB flush by keeping read copies on migration" > + depends on ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > + depends on NUMA_BALANCING > + default n > + help > + TLB flush is necessary when PTE changes by migration. However, > + TLB flush can be deferred if both copies of the src page and > + the dst page are kept until TLB flush if they are non-writable. > + System performance will be improved especially in case that > + promotion and demotion type of migration is heavily happening. > + > menuconfig CGROUPS > bool "Control Group support" > select KERNFS > diff --git a/mm/internal.h b/mm/internal.h > index b90d516ad41f..a8e3168614d6 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -841,6 +841,8 @@ void try_to_unmap_flush(void); > void try_to_unmap_flush_dirty(void); > void flush_tlb_batched_pending(struct mm_struct *mm); > void fold_ubc_nowr(void); > +int nr_flush_required(void); > +int nr_flush_required_nowr(void); > #else > static inline void try_to_unmap_flush(void) > { > @@ -854,6 +856,14 @@ static inline void flush_tlb_batched_pending(struct mm_struct *mm) > static inline void fold_ubc_nowr(void) > { > } > +static inline int nr_flush_required(void) > +{ > + return 0; > +} > +static inline int nr_flush_required_nowr(void) > +{ > + return 0; > +} > #endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */ > > extern const struct trace_print_flags pageflag_names[]; > diff --git a/mm/memory.c b/mm/memory.c > index f69fbc251198..061f23e34d69 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3345,6 +3345,12 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) > > vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte); > > + if (vmf->page) > + folio = page_folio(vmf->page); > + > + if (folio && migrc_pending(folio)) > + migrc_try_flush(); > + > /* > * Shared mapping: we are guaranteed to have VM_WRITE and > * FAULT_FLAG_WRITE set at this point. > @@ -3362,9 +3368,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) > return wp_page_shared(vmf); > } > > - if (vmf->page) > - folio = page_folio(vmf->page); > - > /* > * Private mapping: create an exclusive anonymous page copy if reuse > * is impossible. We might miss VM_WRITE for FOLL_FORCE handling. > diff --git a/mm/migrate.c b/mm/migrate.c > index 01cac26a3127..944c7e179288 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -58,6 +58,244 @@ > > #include "internal.h" > > +#ifdef CONFIG_MIGRC > +static int sysctl_migrc_enable = 1; > +#ifdef CONFIG_SYSCTL > +static int sysctl_migrc_enable_handler(struct ctl_table *table, int write, > + void *buffer, size_t *lenp, loff_t *ppos) > +{ > + struct ctl_table t; > + int err; > + int enabled = sysctl_migrc_enable; > + > + if (write && !capable(CAP_SYS_ADMIN)) > + return -EPERM; > + > + t = *table; > + t.data = &enabled; > + err = proc_dointvec_minmax(&t, write, buffer, lenp, ppos); > + if (err < 0) > + return err; > + if (write) > + sysctl_migrc_enable = enabled; > + return err; > +} > + > +static struct ctl_table migrc_sysctls[] = { > + { > + .procname = "migrc_enable", > + .data = NULL, /* filled in by handler */ > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = sysctl_migrc_enable_handler, > + .extra1 = SYSCTL_ZERO, > + .extra2 = SYSCTL_ONE, > + }, > + {} > +}; > + > +static int __init migrc_sysctl_init(void) > +{ > + register_sysctl_init("vm", migrc_sysctls); > + return 0; > +} > +late_initcall(migrc_sysctl_init); > +#endif > + > +/* > + * TODO: Yeah, it's a non-sense magic number. This simple value manages > + * to work conservatively anyway. However, the value needs to be > + * tuned and adjusted based on the internal condition of memory > + * management subsystem later. > + * > + * Let's start with a simple value for now. > + */ > +static const int migrc_pending_max = 512; /* unit: page */ > + > +atomic_t migrc_gen; > +LLIST_HEAD(migrc_reqs); > +LLIST_HEAD(migrc_reqs_dirty); > + > +enum { > + MIGRC_STATE_NONE, > + MIGRC_SRC_PENDING, > + MIGRC_DST_PENDING, > +}; > + > +#define MAX_MIGRC_REQ_NR 4096 > +static struct migrc_req migrc_req_pool_static[MAX_MIGRC_REQ_NR]; > +static atomic_t migrc_req_pool_idx = ATOMIC_INIT(-1); > +static LLIST_HEAD(migrc_req_pool_llist); > +static DEFINE_SPINLOCK(migrc_req_pool_lock); > + > +static struct migrc_req *alloc_migrc_req(void) > +{ > + int idx = atomic_read(&migrc_req_pool_idx); > + struct llist_node *n; > + > + if (idx < MAX_MIGRC_REQ_NR - 1) { > + idx = atomic_inc_return(&migrc_req_pool_idx); > + if (idx < MAX_MIGRC_REQ_NR) > + return migrc_req_pool_static + idx; > + } > + > + spin_lock(&migrc_req_pool_lock); > + n = llist_del_first(&migrc_req_pool_llist); > + spin_unlock(&migrc_req_pool_lock); > + > + return n ? llist_entry(n, struct migrc_req, llnode) : NULL; > +} > + > +void free_migrc_req(struct migrc_req *req) > +{ > + llist_add(&req->llnode, &migrc_req_pool_llist); > +} > + > +static bool migrc_full(int nid) > +{ > + struct pglist_data *node = NODE_DATA(nid); > + > + if (migrc_pending_max == -1) > + return false; > + > + return atomic_read(&node->migrc_pending_nr) >= migrc_pending_max; > +} > + > +void migrc_init_page(struct page *p) > +{ > + WRITE_ONCE(p->migrc_state, MIGRC_STATE_NONE); > +} > + > +/* > + * The list should be isolated before. > + */ > +void migrc_shrink(struct llist_head *h) > +{ > + struct page *p; > + struct llist_node *n; > + > + n = llist_del_all(h); > + llist_for_each_entry(p, n, migrc_node) { > + if (p->migrc_state == MIGRC_SRC_PENDING) { > + struct pglist_data *node; > + struct zone *zone; > + > + node = NODE_DATA(page_to_nid(p)); > + zone = page_zone(p); > + atomic_dec(&node->migrc_pending_nr); > + atomic_dec(&zone->migrc_pending_nr); > + } > + WRITE_ONCE(p->migrc_state, MIGRC_STATE_NONE); > + folio_put(page_folio(p)); > + } > +} > + > +bool migrc_pending(struct folio *f) > +{ > + return READ_ONCE(f->page.migrc_state) != MIGRC_STATE_NONE; > +} > + > +static void migrc_expand_req(struct folio *fsrc, struct folio *fdst) > +{ > + struct migrc_req *req; > + struct pglist_data *node; > + struct zone *zone; > + > + req = fold_ubc_nowr_migrc_req(); > + if (!req) > + return; > + > + folio_get(fsrc); > + folio_get(fdst); > + WRITE_ONCE(fsrc->page.migrc_state, MIGRC_SRC_PENDING); > + WRITE_ONCE(fdst->page.migrc_state, MIGRC_DST_PENDING); > + > + if (llist_add(&fsrc->page.migrc_node, &req->pages)) > + req->last = &fsrc->page.migrc_node; > + llist_add(&fdst->page.migrc_node, &req->pages); > + > + node = NODE_DATA(folio_nid(fsrc)); > + zone = page_zone(&fsrc->page); > + atomic_inc(&node->migrc_pending_nr); > + atomic_inc(&zone->migrc_pending_nr); > + > + if (migrc_full(folio_nid(fsrc))) > + migrc_try_flush(); > +} > + > +void migrc_req_start(void) > +{ > + struct migrc_req *req; > + struct migrc_req *req_dirty; > + > + if (WARN_ON(current->mreq || current->mreq_dirty)) > + return; > + > + req = alloc_migrc_req(); > + req_dirty = alloc_migrc_req(); > + > + if (!req || !req_dirty) > + goto fail; > + > + arch_tlbbatch_clean(&req->arch); > + init_llist_head(&req->pages); > + req->last = NULL; > + current->mreq = req; > + > + arch_tlbbatch_clean(&req_dirty->arch); > + init_llist_head(&req_dirty->pages); > + req_dirty->last = NULL; > + current->mreq_dirty = req_dirty; > + return; > +fail: > + if (req_dirty) > + free_migrc_req(req_dirty); > + if (req) > + free_migrc_req(req); > +} > + > +void migrc_req_end(void) > +{ > + struct migrc_req *req = current->mreq; > + struct migrc_req *req_dirty = current->mreq_dirty; > + > + WARN_ON((!req && req_dirty) || (req && !req_dirty)); > + > + if (!req || !req_dirty) > + return; > + > + if (llist_empty(&req->pages)) { > + free_migrc_req(req); > + } else { > + req->gen = atomic_inc_return(&migrc_gen); > + llist_add(&req->llnode, &migrc_reqs); > + } > + current->mreq = NULL; > + > + if (llist_empty(&req_dirty->pages)) { > + free_migrc_req(req_dirty); > + } else { > + req_dirty->gen = atomic_inc_return(&migrc_gen); > + llist_add(&req_dirty->llnode, &migrc_reqs_dirty); > + } > + current->mreq_dirty = NULL; > +} > + > +bool migrc_req_processing(void) > +{ > + return current->mreq && current->mreq_dirty; > +} > + > +int migrc_pending_nr_in_zone(struct zone *z) > +{ > + return atomic_read(&z->migrc_pending_nr); > +} > +#else > +static const int sysctl_migrc_enable; > +static bool migrc_full(int nid) { return true; } > +static void migrc_expand_req(struct folio *fsrc, struct folio *fdst) {} > +#endif > + > bool isolate_movable_page(struct page *page, isolate_mode_t mode) > { > struct folio *folio = folio_get_nontail_page(page); > @@ -383,6 +621,9 @@ static int folio_expected_refs(struct address_space *mapping, > struct folio *folio) > { > int refs = 1; > + > + refs += migrc_pending(folio) ? 1 : 0; > + > if (!mapping) > return refs; > > @@ -1060,6 +1301,12 @@ static void migrate_folio_undo_src(struct folio *src, > bool locked, > struct list_head *ret) > { > + /* > + * TODO: There might be folios already pending for migrc. > + * However, there's no way to cancel those on failure for now. > + * Let's reflect the requirement when needed. > + */ > + > if (page_was_mapped) > remove_migration_ptes(src, src, false); > /* Drop an anon_vma reference if we took one */ > @@ -1627,10 +1874,17 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page, > LIST_HEAD(unmap_folios); > LIST_HEAD(dst_folios); > bool nosplit = (reason == MR_NUMA_MISPLACED); > + bool migrc_cond1; > > VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC && > !list_empty(from) && !list_is_singular(from)); > > + migrc_cond1 = sysctl_migrc_enable && > + ((reason == MR_DEMOTION && current_is_kswapd()) || > + reason == MR_NUMA_MISPLACED); > + > + if (migrc_cond1) > + migrc_req_start(); > for (pass = 0; pass < nr_pass && (retry || large_retry); pass++) { > retry = 0; > large_retry = 0; > @@ -1638,6 +1892,10 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page, > nr_retry_pages = 0; > > list_for_each_entry_safe(folio, folio2, from, lru) { > + int nr_required; > + bool migrc_cond2; > + bool migrc; > + > /* > * Large folio statistics is based on the source large > * folio. Capture required information that might get > @@ -1671,8 +1929,14 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page, > continue; > } > > + nr_required = nr_flush_required(); > rc = migrate_folio_unmap(get_new_page, put_new_page, private, > folio, &dst, mode, reason, ret_folios); > + migrc_cond2 = nr_required == nr_flush_required() && > + nr_flush_required_nowr() && > + !migrc_full(folio_nid(folio)); > + migrc = migrc_cond1 && migrc_cond2; > + > /* > * The rules are: > * Success: folio will be freed > @@ -1722,9 +1986,11 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page, > nr_large_failed += large_retry; > stats->nr_thp_failed += thp_retry; > rc_saved = rc; > - if (list_empty(&unmap_folios)) > + if (list_empty(&unmap_folios)) { > + if (migrc_cond1) > + migrc_req_end(); > goto out; > - else > + } else > goto move; > case -EAGAIN: > if (is_large) { > @@ -1742,6 +2008,13 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page, > case MIGRATEPAGE_UNMAP: > list_move_tail(&folio->lru, &unmap_folios); > list_add_tail(&dst->lru, &dst_folios); > + > + if (migrc) > + /* > + * XXX: On migration failure, > + * extra TLB flush might happen. > + */ > + migrc_expand_req(folio, dst); > break; > default: > /* > @@ -1760,6 +2033,7 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page, > stats->nr_failed_pages += nr_pages; > break; > } > + fold_ubc_nowr(); > } > } > nr_failed += retry; > @@ -1767,6 +2041,15 @@ static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page, > stats->nr_thp_failed += thp_retry; > stats->nr_failed_pages += nr_retry_pages; > move: > + /* > + * Should be prior to try_to_unmap_flush() so that > + * migrc_try_flush() that will be performed later based on the > + * gen # assigned in migrc_req_end(), can take benefit of the > + * TLB flushes in try_to_unmap_flush(). > + */ > + if (migrc_cond1) > + migrc_req_end(); > + > /* Flush TLBs for all unmapped folios */ > try_to_unmap_flush(); > > diff --git a/mm/mm_init.c b/mm/mm_init.c > index 7f7f9c677854..87cbddc7d780 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -558,6 +558,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn, > page_mapcount_reset(page); > page_cpupid_reset_last(page); > page_kasan_tag_reset(page); > + migrc_init_page(page); > > INIT_LIST_HEAD(&page->lru); > #ifdef WANT_PAGE_VIRTUAL > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 47421bedc12b..167dadb0d817 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3176,6 +3176,11 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, > long min = mark; > int o; > > + /* > + * There are pages that can be freed by migrc_try_flush(). > + */ > + free_pages += migrc_pending_nr_in_zone(z); > + > /* free_pages may go negative - that's OK */ > free_pages -= __zone_watermark_unusable_free(z, order, alloc_flags); > > @@ -4254,6 +4259,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > unsigned int zonelist_iter_cookie; > int reserve_flags; > > + migrc_try_flush(); > restart: > compaction_retries = 0; > no_progress_loops = 0; > @@ -4769,6 +4775,16 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, > if (likely(page)) > goto out; > > + if (order && migrc_try_flush()) { > + /* > + * Try again after freeing migrc's pending pages in case > + * of high order allocation. > + */ > + page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac); > + if (likely(page)) > + goto out; > + } > + > alloc_gfp = gfp; > ac.spread_dirty_pages = false; > > diff --git a/mm/rmap.c b/mm/rmap.c > index d18460a48485..5b251eb01cd4 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -606,6 +606,86 @@ struct anon_vma *folio_lock_anon_vma_read(struct folio *folio, > > #ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH > > +#ifdef CONFIG_MIGRC > +static bool __migrc_try_flush(struct llist_head *h) > +{ > + struct arch_tlbflush_unmap_batch arch; > + struct llist_node *reqs; > + struct migrc_req *req; > + struct migrc_req *req2; > + LLIST_HEAD(pages); > + > + reqs = llist_del_all(h); > + if (!reqs) > + return false; > + > + arch_tlbbatch_clean(&arch); > + > + /* > + * TODO: Optimize the time complexity. > + */ > + llist_for_each_entry_safe(req, req2, reqs, llnode) { > + struct llist_node *n; > + > + arch_migrc_adj(&req->arch, req->gen); > + arch_tlbbatch_fold(&arch, &req->arch); > + > + n = llist_del_all(&req->pages); > + llist_add_batch(n, req->last, &pages); > + free_migrc_req(req); > + } > + > + arch_tlbbatch_flush(&arch); > + migrc_shrink(&pages); > + return true; > +} > + > +bool migrc_try_flush(void) > +{ > + bool ret; > + > + if (migrc_req_processing()) { > + migrc_req_end(); > + migrc_req_start(); > + } > + ret = __migrc_try_flush(&migrc_reqs); > + ret = ret || __migrc_try_flush(&migrc_reqs_dirty); > + > + return ret; > +} > + > +void migrc_try_flush_dirty(void) > +{ > + if (migrc_req_processing()) { > + migrc_req_end(); > + migrc_req_start(); > + } > + __migrc_try_flush(&migrc_reqs_dirty); > +} > + > +struct migrc_req *fold_ubc_nowr_migrc_req(void) > +{ > + struct tlbflush_unmap_batch *tlb_ubc_nowr = ¤t->tlb_ubc_nowr; > + struct migrc_req *req; > + bool dirty; > + > + if (!tlb_ubc_nowr->nr_flush_required) > + return NULL; > + > + dirty = tlb_ubc_nowr->writable; > + req = dirty ? current->mreq_dirty : current->mreq; > + if (!req) { > + fold_ubc_nowr(); > + return NULL; > + } > + > + arch_tlbbatch_fold(&req->arch, &tlb_ubc_nowr->arch); > + tlb_ubc_nowr->nr_flush_required = 0; > + tlb_ubc_nowr->writable = false; > + return req; > +} > +#endif > + > void fold_ubc_nowr(void) > { > struct tlbflush_unmap_batch *tlb_ubc = ¤t->tlb_ubc; > @@ -621,6 +701,16 @@ void fold_ubc_nowr(void) > tlb_ubc_nowr->writable = false; > } > > +int nr_flush_required(void) > +{ > + return current->tlb_ubc.nr_flush_required; > +} > + > +int nr_flush_required_nowr(void) > +{ > + return current->tlb_ubc_nowr.nr_flush_required; > +} > + > /* > * Flush TLB entries for recently unmapped pages from remote CPUs. It is > * important if a PTE was dirty when it was unmapped that it's flushed > @@ -648,6 +738,8 @@ void try_to_unmap_flush_dirty(void) > > if (tlb_ubc->writable || tlb_ubc_nowr->writable) > try_to_unmap_flush(); > + > + migrc_try_flush_dirty(); > } > > /*