Re: [PATCH-cgroup 1/2] cgroup/cpuset: Fix remote root partition creation problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/6/24 09:01, Xavier wrote:
Hi Longman,

I have a small question about your new patch.
I wonder that in cgroup v2,  will there be any overlap between valid partition roots and top_cpuset? If it is not, the section starting with 'restart:' that searches for overlapping cpusets can be skipped for cgroup v2. Otherwise, if there are any overlap, then the assignment to 'dom' may need perform an cpumask_or operation?

In cgroup v2, the top_cpuset is a non-isolating partition root by itself and its partition root state cannot be changed.

The reason for the introduction of the partition feature in cgroup v2 was to support the creation of a separate sched domain. cgroup v1 supports that indirectly via clever use of the cpuset.sched_load_balance flag. So by definition, the presence of a non-isolating partition root defines a new sched domain.

Cheers,
Longman

Best regards,
Xavier

----- Original Message -----
From: Waiman Long <longman@xxxxxxxxxx>
To: Tejun Heo <tj@xxxxxxxxxx>, Zefan Li <lizefan.x@xxxxxxxxxxxxx>, Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: cgroups@xxxxxxxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx, Xavier <ghostxavier@xxxxxxxx>, Waiman Long <longman@xxxxxxxxxx>
Subject: [PATCH-cgroup 1/2] cgroup/cpuset: Fix remote root partition creation problem
Date: 2024-06-06 01:19

Since commit 181c8e091aae ("cgroup/cpuset: Introduce remote partition"),
a remote partition can be created underneath a non-partition root cpuset
as long as its exclusive_cpus are set to distribute exclusive CPUs down
to its children. The generate_sched_domains() function, however, doesn't
take into account this new behavior and hence will fail to create the
sched domain needed for a remote root (non-isolated) partition.
There are two issues related to remote partition support. First of
all, generate_sched_domains() has a fast path that is activated if
root_load_balance is true and top_cpuset.nr_subparts is non-zero. The
later condition isn't quite correct for remote partitions as nr_subparts
just shows the number of local child partitions underneath it. There
can be no local child partition under top_cpuset even if there are
remote partitions further down the hierarchy. Fix that by checking
for subpartitions_cpus which contains exclusive CPUs allocated to both
local and remote partitions.
Secondly, the valid partition check for subtree skipping in the csa[]
generation loop isn't enough as remote partition does not need to
have a partition root parent. Fix this problem by breaking csa[] array
generation loop of generate_sched_domains() into v1 and v2 specific parts
and checking a cpuset's exclusive_cpus before skipping its subtree in
the v2 case.
Also simplify generate_sched_domains() for cgroup v2 as only
non-isolating partition roots should be included in building the cpuset
array and none of the v1 scheduling attributes other than a different
way to create an isolated partition are supported.
Fixes: 181c8e091aae ("cgroup/cpuset: Introduce remote partition")
Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
kernel/cgroup/cpuset.c | 55 ++++++++++++++++++++++++++++++++----------
1 file changed, 42 insertions(+), 13 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index f9b97f65e204..fb71d710a603 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -169,7 +169,7 @@ struct cpuset {
	/* for custom sched domain */
	int relax_domain_level;

-	/* number of valid sub-partitions */
+	/* number of valid local child partitions */
	int nr_subparts;

	/* partition root state */
@@ -957,13 +957,14 @@ static int generate_sched_domains(cpumask_var_t **domains,
	int nslot;		/* next empty doms[] struct cpumask slot */
	struct cgroup_subsys_state *pos_css;
	bool root_load_balance = is_sched_load_balance(&top_cpuset);
+	bool cgrpv2 = cgroup_subsys_on_dfl(cpuset_cgrp_subsys);

	doms = NULL;
	dattr = NULL;
	csa = NULL;

	/* Special case for the 99% of systems with one, full, sched domain */
-	if (root_load_balance && !top_cpuset.nr_subparts) {
+	if (root_load_balance && cpumask_empty(subpartitions_cpus)) {
single_root_domain:
		ndoms = 1;
		doms = alloc_sched_domains(ndoms);
@@ -992,16 +993,18 @@ static int generate_sched_domains(cpumask_var_t **domains,
	cpuset_for_each_descendant_pre(cp, pos_css, &top_cpuset) {
		if (cp == &top_cpuset)
			continue;
+
+		if (cgrpv2)
+			goto v2;
+
		/*
+		 * v1:
		 * Continue traversing beyond @cp iff @cp has some CPUs and
		 * isn't load balancing.  The former is obvious.  The
		 * latter: All child cpusets contain a subset of the
		 * parent's cpus, so just skip them, and then we call
		 * update_domain_attr_tree() to calc relax_domain_level of
		 * the corresponding sched domain.
-		 *
-		 * If root is load-balancing, we can skip @cp if it
-		 * is a subset of the root's effective_cpus.
		 */
		if (!cpumask_empty(cp->cpus_allowed) &&
		    !(is_sched_load_balance(cp) &&
@@ -1009,16 +1012,28 @@ static int generate_sched_domains(cpumask_var_t **domains,
					 housekeeping_cpumask(HK_TYPE_DOMAIN))))
			continue;

-		if (root_load_balance &&
-		    cpumask_subset(cp->cpus_allowed, top_cpuset.effective_cpus))
-			continue;
-
		if (is_sched_load_balance(cp) &&
		    !cpumask_empty(cp->effective_cpus))
			csa[csn++] = cp;

-		/* skip @cp's subtree if not a partition root */
-		if (!is_partition_valid(cp))
+		/* skip @cp's subtree */
+		pos_css = css_rightmost_descendant(pos_css);
+		continue;
+
+v2:
+		/*
+		 * Only valid partition roots that are not isolated and with
+		 * non-empty effective_cpus will be saved into csn[].
+		 */
+		if ((cp->partition_root_state == PRS_ROOT) &&
+		    !cpumask_empty(cp->effective_cpus))
+			csa[csn++] = cp;
+
+		/*
+		 * Skip @cp's subtree if not a partition root and has no
+		 * exclusive CPUs to be granted to child cpusets.
+		 */
+		if (!is_partition_valid(cp) && cpumask_empty(cp->exclusive_cpus))
			pos_css = css_rightmost_descendant(pos_css);
	}
	rcu_read_unlock();
@@ -1072,6 +1087,20 @@ static int generate_sched_domains(cpumask_var_t **domains,
	dattr = kmalloc_array(ndoms, sizeof(struct sched_domain_attr),
			      GFP_KERNEL);

+	/*
+	 * Cgroup v2 doesn't support domain attributes, just set all of them
+	 * to SD_ATTR_INIT. Also non-isolating partition root CPUs are a
+	 * subset of HK_TYPE_DOMAIN housekeeping CPUs.
+	 */
+	if (cgrpv2) {
+		for (i = 0; i < ndoms; i++) {
+			cpumask_copy(doms[i], csa[i]->effective_cpus);         /*****************************************/
+			if (dattr)
+				dattr[i] = SD_ATTR_INIT;
+		}
+		goto done;
+	}
+
	for (nslot = 0, i = 0; i < csn; i++) {
		struct cpuset *a = csa[i];
		struct cpumask *dp;
@@ -1231,7 +1260,7 @@ static void rebuild_sched_domains_locked(void)
	 * root should be only a subset of the active CPUs.  Since a CPU in any
	 * partition root could be offlined, all must be checked.
	 */
-	if (top_cpuset.nr_subparts) {
+	if (!cpumask_empty(subpartitions_cpus)) {
		rcu_read_lock();
		cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
			if (!is_partition_valid(cs)) {
@@ -4575,7 +4604,7 @@ static void cpuset_handle_hotplug(void)
	 * In the rare case that hotplug removes all the cpus in
	 * subpartitions_cpus, we assumed that cpus are updated.
	 */
-	if (!cpus_updated && top_cpuset.nr_subparts)
+	if (!cpus_updated && !cpumask_empty(subpartitions_cpus))
		cpus_updated = true;

	/* For v1, synchronize cpus_allowed to cpu_active_mask */
--
2.39.3





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]     [Monitors]

  Powered by Linux