On 07/06/2018 04:32 PM, Waiman Long wrote: > On 07/03/2018 11:58 AM, Tejun Heo wrote: >> Hello, Waiman. >> >> On Tue, Jul 03, 2018 at 08:41:31AM +0800, Waiman Long wrote: >>>> So, effective changing when enabling partition on a child feels wrong >>>> to me. It's supposed to contain what's actually allowed to the cgroup >>>> from its parent and that shouldn't change regardless of how those >>>> resources are used. It's still given to the cgroup from its parent. >>> Another way to work around this issue is to expose the reserved_cpus in >>> the parent for holding CPUs that can taken by a chid partition. That >>> will require adding one more cpuset file for those cgroups that are >>> partition roots. >> Yeah, that should work. >> > Thinking about it a bit more, that approach will make creating a > partition a multi-step process: > > 1) Reserve the CPUs in reserved_cpus. > 2) enable sched.partition > 3) Write the CPUs list into cpus. > > There are also more exception cases that need to be handled. The current > approach, on the other hands, is much simpler and easier to understand > and use. > >>> I don't mind restricting that to the first level children for now. That >>> does restrict where we can put the container root if we want a separate >>> partition for a container. Let's hear if others have any objection about >>> that. >> As currently implemented, partioning locks away the cpus which should >> be a system level decision, not container level, so it makes sense to >> me that it is only available to system root. > So my preference is to allow partition only on the first level children > of the root for the time being. I think it should cover most of the use > cases. I will update the patchset to reflect that. > > Cheers, > Longman > Below is the incremental patch that allow partitioning only on the first level children of the root. Please let me know your thourght on that. Thanks, Longman -------------[ Cut here ]------------------------- >From 5a41209da94385efff87d79f6523265c710cbea5 Mon Sep 17 00:00:00 2001 From: Waiman Long <longman@xxxxxxxxxx> Date: Tue, 10 Jul 2018 10:23:16 -0400 Subject: [PATCH v11 10/10] cpuset: Restrict sched.partition to first level children of root only Enabling partition on a v2 cpuset has the side effect of affecting the effective CPUs of its parent which is currently unique to partitioning. As we are not sure about the repercussion of enabling that globally, we are now restricting the enabling of sched.partition to the first level children of the root cgroup in the default hierarchy. This is done by removing the "cpuset.sched.partition" control file on cgroups that are not the first level children of the root. A new show_cfile function pointer is added to the cftype structure. If it is defined, it will be called to return a boolean value to determine if the corresponding control file should show up in the cgroup. It provides a more flexible mechanism to determine the visibility of the control file than a simple CFTYPE_* flag can do. Signed-off-by: Waiman Long <longman@xxxxxxxxxx> --- Documentation/admin-guide/cgroup-v2.rst | 10 +++++----- include/linux/cgroup-defs.h | 9 +++++++++ kernel/cgroup/cgroup.c | 4 ++++ kernel/cgroup/cpuset.c | 10 ++++++++++ 4 files changed, 28 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index f7cde15..cf7cd88 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1585,10 +1585,10 @@ Cpuset Interface Files Its value will be affected by memory nodes hotplug events. cpuset.sched.partition - A read-write single value file which exists on non-root - cpuset-enabled cgroups. It is a binary value flag that accepts - either "0" (off) or "1" (on). This flag is set and owned by the - parent cgroup. + A read-write single value file which exists on the first level + children of the root cgroup. It is a binary value flag that + accepts either "0" (off) or "1" (on). This flag is set and + owned by the parent cgroup. If set, it indicates that the current cgroup is the root of a new partition or scheduling domain that comprises itself and @@ -1603,7 +1603,7 @@ Cpuset Interface Files exclusive, i.e. they are not shared by any of its siblings. 2) The "cpuset.cpus" is also a proper subset of the parent's "cpuset.cpus.effective". - 3) The parent cgroup is a partition root. + 3) The parent cgroup is the root cgroup. 4) There is no child cgroups with cpuset enabled. This is for eliminating corner cases that have to be handled if such a condition is allowed. diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index c0e68f9..be79487 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -565,6 +565,15 @@ struct cftype { ssize_t (*write)(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off); + /* + * show_cfile(), if defined, will return a boolean value to + * determine if the control file should show up in the cgroup. + * It provides more flexibility in deciding where the control + * file should appear than simple criteria like on-root or + * not-on-root. + */ + bool (*show_cfile)(struct cgroup_subsys_state *css); + #ifdef CONFIG_DEBUG_LOCK_ALLOC struct lock_class_key lockdep_key; #endif diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 077370b..0afdab8 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3612,6 +3612,10 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css, if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgroup_parent(cgrp)) continue; + /* Should the control file show up in the cgroup */ + if (cft->show_cfile && !cft->show_cfile(css)) + continue; + if (is_add) { ret = cgroup_add_file(css, cgrp, cft); if (ret) { diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 62b7e61..2f85c1e 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -2106,6 +2106,15 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) } /* + * The sched.partition control file should only show up in the first + * level children of the root cgroup. + */ +static bool cpuset_show_partition(struct cgroup_subsys_state *css) +{ + return parent_cs(css_cs(css)) == &top_cpuset; +} + +/* * for the common functions, 'private' gives the type of file */ @@ -2250,6 +2259,7 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) .name = "sched.partition", .read_u64 = cpuset_read_u64, .write_u64 = cpuset_write_u64, + .show_cfile = cpuset_show_partition, .private = FILE_PARTITION_ROOT, .flags = CFTYPE_NOT_ON_ROOT, }, -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html