Re: [PATCH v3 3/7] padata: dispatch works on different nodes

Gang Li <gang.li@xxxxxxxxx> · Mon, 15 Jan 2024 16:57:40 +0800

On 2024/1/13 02:27, Tim Chen wrote:
On Fri, 2024-01-12 at 15:09 +0800, Gang Li wrote:
On 2024/1/12 01:50, Tim Chen wrote:
On Tue, 2024-01-02 at 21:12 +0800, Gang Li wrote:
When a group of tasks that access different nodes are scheduled on the
same node, they may encounter bandwidth bottlenecks and access latency.

Thus, numa_aware flag is introduced here, allowing tasks to be
distributed across different nodes to fully utilize the advantage of
multi-node systems.

Signed-off-by: Gang Li <gang.li@xxxxxxxxx>
---
   include/linux/padata.h | 3 +++
   kernel/padata.c        | 8 ++++++--
   mm/mm_init.c           | 1 +
   3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 495b16b6b4d72..f79ccd50e7f40 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -137,6 +137,8 @@ struct padata_shell {
    *             appropriate for one worker thread to do at once.
    * @max_threads: Max threads to use for the job, actual number may be less
    *               depending on task size and minimum chunk size.
+ * @numa_aware: Dispatch jobs to different nodes. If a node only has memory but
+ *              no CPU, dispatch its jobs to a random CPU.
    */
   struct padata_mt_job {
   	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
@@ -146,6 +148,7 @@ struct padata_mt_job {
   	unsigned long		align;
   	unsigned long		min_chunk;
   	int			max_threads;
+	bool			numa_aware;
   };
   
   /**
diff --git a/kernel/padata.c b/kernel/padata.c
index 179fb1518070c..1c2b3a337479e 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -485,7 +485,7 @@ void __init padata_do_multithreaded(struct padata_mt_job *job)
   	struct padata_work my_work, *pw;
   	struct padata_mt_job_state ps;
   	LIST_HEAD(works);
-	int nworks;
+	int nworks, nid = 0;

If we always start from 0, we may be biased towards the low numbered node,
and not use high numbered nodes at all.  Suggest you do
static nid = 0;


When we use `static`, if there are multiple parallel calls to
`padata_do_multithreaded`, it may result in an uneven distribution of
tasks for each padata_do_multithreaded.

We can make the following modifications to address this issue.

```
diff --git a/kernel/padata.c b/kernel/padata.c
index 1c2b3a337479e..925e48df6dd8d 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -485,7 +485,8 @@ void __init padata_do_multithreaded(struct
padata_mt_job *job)
          struct padata_work my_work, *pw;
          struct padata_mt_job_state ps;
          LIST_HEAD(works);
-       int nworks, nid = 0;
+       int nworks, nid;
+       static volatile int global_nid = 0;

          if (job->size == 0)
                  return;
@@ -516,12 +517,15 @@ void __init padata_do_multithreaded(struct
padata_mt_job *job)
          ps.chunk_size = max(ps.chunk_size, job->min_chunk);
          ps.chunk_size = roundup(ps.chunk_size, job->align);

+       nid = global_nid;
          list_for_each_entry(pw, &works, pw_list)
-               if (job->numa_aware)
-                       queue_work_node((++nid % num_node_state(N_MEMORY)),
-                                       system_unbound_wq, &pw->pw_work);
-               else
+               if (job->numa_aware) {
+                       queue_work_node(nid, system_unbound_wq,
&pw->pw_work);
+                       nid = next_node(nid, node_states[N_CPU]);
+               } else
                          queue_work(system_unbound_wq, &pw->pw_work);
+       if (job->numa_aware)
+               global_nid = nid;

Thinking more about it, there could still be multiple threads working
at the same time with stale global_nid.  We should probably do a compare
exchange of global_nid with new nid only if the global nid was unchanged.
Otherwise we should go to the next node with the changed global nid before
we queue the job.

Tim

How about:
```
nid = global_nid;
list_for_each_entry(pw, &works, pw_list)
	if (job->numa_aware) {
		int old_node = nid;
		queue_work_node(nid, system_unbound_wq, &pw->pw_work);
		nid = next_node(nid, node_states[N_CPU]);
		cmpxchg(&global_nid, old_node, nid);
	} else
		queue_work(system_unbound_wq, &pw->pw_work);

```