On 5 Jan 2025, at 20:18, Hyeonggon Yoo wrote: > On 2025-01-04 2:24 AM, Zi Yan wrote: >> Now page copies are batched, multi-threaded page copy can be used to >> increase page copy throughput. Add copy_page_lists_mt() to copy pages in >> multi-threaded manners. Empirical data show more than 32 base pages are >> needed to show the benefit of using multi-threaded page copy, so use 32 as >> the threshold. >> >> Signed-off-by: Zi Yan <ziy@xxxxxxxxxx> >> --- >> include/linux/migrate.h | 3 + >> mm/Makefile | 2 +- >> mm/copy_pages.c | 186 ++++++++++++++++++++++++++++++++++++++++ >> mm/migrate.c | 19 ++-- >> 4 files changed, 199 insertions(+), 11 deletions(-) >> create mode 100644 mm/copy_pages.c >> > > [...snip...] > >> +++ b/mm/copy_pages.c >> @@ -0,0 +1,186 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Parallel page copy routine. >> + */ >> + >> +#include <linux/sysctl.h> >> +#include <linux/highmem.h> >> +#include <linux/workqueue.h> >> +#include <linux/slab.h> >> +#include <linux/migrate.h> >> + >> + >> +unsigned int limit_mt_num = 4; >> + >> +struct copy_item { >> + char *to; >> + char *from; >> + unsigned long chunk_size; >> +}; >> + >> +struct copy_page_info { >> + struct work_struct copy_page_work; >> + unsigned long num_items; >> + struct copy_item item_list[]; >> +}; >> + >> +static void copy_page_routine(char *vto, char *vfrom, >> + unsigned long chunk_size) >> +{ >> + memcpy(vto, vfrom, chunk_size); >> +} >> + >> +static void copy_page_work_queue_thread(struct work_struct *work) >> +{ >> + struct copy_page_info *my_work = (struct copy_page_info *)work; >> + int i; >> + >> + for (i = 0; i < my_work->num_items; ++i) >> + copy_page_routine(my_work->item_list[i].to, >> + my_work->item_list[i].from, >> + my_work->item_list[i].chunk_size); >> +} >> + >> +int copy_page_lists_mt(struct list_head *dst_folios, >> + struct list_head *src_folios, int nr_items) >> +{ >> + int err = 0; >> + unsigned int total_mt_num = limit_mt_num; >> + int to_node = folio_nid(list_first_entry(dst_folios, struct folio, lru)); >> + int i; >> + struct copy_page_info *work_items[32] = {0}; >> + const struct cpumask *per_node_cpumask = cpumask_of_node(to_node); > > What happens here if to_node is a NUMA node without CPUs? (e.g. CXL > node). I did not think about that case. In that case, from_node will be used. If both from and to are CPUless nodes, maybe the node of the executing CPU should be used to select cpumask here. > > And even with a NUMA node with CPUs I think offloading copies to CPUs > of either "from node" or "to node" will end up a CPU touching two pages > in two different NUMA nodes anyway, one page in the local node > and the other page in the remote node. > > In that sense, I don't understand when push_0_pull_1 (introduced in > patch 5) should be 0 or 1. Am I missing something?