Re: [RFC PATCH 4/5] mm/migrate: introduce multi-threaded page copy routine

Zi Yan <ziy@xxxxxxxxxx> · Sun, 05 Jan 2025 21:01:48 -0500

On 5 Jan 2025, at 20:18, Hyeonggon Yoo wrote:

> On 2025-01-04 2:24 AM, Zi Yan wrote:
>> Now page copies are batched, multi-threaded page copy can be used to
>> increase page copy throughput. Add copy_page_lists_mt() to copy pages in
>> multi-threaded manners. Empirical data show more than 32 base pages are
>> needed to show the benefit of using multi-threaded page copy, so use 32 as
>> the threshold.
>>
>> Signed-off-by: Zi Yan <ziy@xxxxxxxxxx>
>> ---
>>   include/linux/migrate.h |   3 +
>>   mm/Makefile             |   2 +-
>>   mm/copy_pages.c         | 186 ++++++++++++++++++++++++++++++++++++++++
>>   mm/migrate.c            |  19 ++--
>>   4 files changed, 199 insertions(+), 11 deletions(-)
>>   create mode 100644 mm/copy_pages.c
>>
>
> [...snip...]
>
>> +++ b/mm/copy_pages.c
>> @@ -0,0 +1,186 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Parallel page copy routine.
>> + */
>> +
>> +#include <linux/sysctl.h>
>> +#include <linux/highmem.h>
>> +#include <linux/workqueue.h>
>> +#include <linux/slab.h>
>> +#include <linux/migrate.h>
>> +
>> +
>> +unsigned int limit_mt_num = 4;
>> +
>> +struct copy_item {
>> +	char *to;
>> +	char *from;
>> +	unsigned long chunk_size;
>> +};
>> +
>> +struct copy_page_info {
>> +	struct work_struct copy_page_work;
>> +	unsigned long num_items;
>> +	struct copy_item item_list[];
>> +};
>> +
>> +static void copy_page_routine(char *vto, char *vfrom,
>> +	unsigned long chunk_size)
>> +{
>> +	memcpy(vto, vfrom, chunk_size);
>> +}
>> +
>> +static void copy_page_work_queue_thread(struct work_struct *work)
>> +{
>> +	struct copy_page_info *my_work = (struct copy_page_info *)work;
>> +	int i;
>> +
>> +	for (i = 0; i < my_work->num_items; ++i)
>> +		copy_page_routine(my_work->item_list[i].to,
>> +						  my_work->item_list[i].from,
>> +						  my_work->item_list[i].chunk_size);
>> +}
>> +
>> +int copy_page_lists_mt(struct list_head *dst_folios,
>> +		struct list_head *src_folios, int nr_items)
>> +{
>> +	int err = 0;
>> +	unsigned int total_mt_num = limit_mt_num;
>> +	int to_node = folio_nid(list_first_entry(dst_folios, struct folio, lru));
>> +	int i;
>> +	struct copy_page_info *work_items[32] = {0};
>> +	const struct cpumask *per_node_cpumask = cpumask_of_node(to_node);
>
> What happens here if to_node is a NUMA node without CPUs? (e.g. CXL
> node).

I did not think about that case. In that case, from_node will be used.
If both from and to are CPUless nodes, maybe the node of the executing
CPU should be used to select cpumask here.

>
> And even with a NUMA node with CPUs I think offloading copies to CPUs
> of either "from node" or "to node" will end up a CPU touching two pages
> in two different NUMA nodes anyway, one page in the local node
> and the other page in the remote node.
>
> In that sense, I don't understand when push_0_pull_1 (introduced in
> patch 5) should be 0 or 1. Am I missing something?