On Fri, Feb 8, 2019 at 2:06 AM Patrick Bellasi <patrick.bellasi@xxxxxxx> wrote: > > In order to properly support hierarchical resources control, the cgroup > delegation model requires that attribute writes from a child group never > fail but still are (potentially) constrained based on parent's assigned > resources. This requires to properly propagate and aggregate parent > attributes down to its descendants. > > Let's implement this mechanism by adding a new "effective" clamp value > for each task group. The effective clamp value is defined as the smaller > value between the clamp value of a group and the effective clamp value > of its parent. This is the actual clamp value enforced on tasks in a > task group. In patch 10 in this series you mentioned "b) do not enforce any constraints and/or dependencies between the parent and its child nodes" This patch seems to change that behavior. If so, should it be documented? > Since it can be interesting for userspace, e.g. system management > software, to know exactly what the currently propagated/enforced > configuration is, the effective clamp values are exposed to user-space > by means of a new pair of read-only attributes > cpu.util.{min,max}.effective. > > Signed-off-by: Patrick Bellasi <patrick.bellasi@xxxxxxx> > Cc: Ingo Molnar <mingo@xxxxxxxxxx> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Tejun Heo <tj@xxxxxxxxxx> > > --- > Changes in v7: > Others: > - ensure clamp values are not tunable at root cgroup level > --- > Documentation/admin-guide/cgroup-v2.rst | 19 ++++ > kernel/sched/core.c | 118 +++++++++++++++++++++++- > 2 files changed, 133 insertions(+), 4 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 47710a77f4fa..7aad2435e961 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -990,6 +990,16 @@ All time durations are in microseconds. > values similar to the sched_setattr(2). This minimum utilization > value is used to clamp the task specific minimum utilization clamp. > > + cpu.util.min.effective > + A read-only single value file which exists on non-root cgroups and > + reports minimum utilization clamp value currently enforced on a task > + group. > + > + The actual minimum utilization in the range [0, 1024]. > + > + This value can be lower then cpu.util.min in case a parent cgroup > + allows only smaller minimum utilization values. > + > cpu.util.max > A read-write single value file which exists on non-root cgroups. > The default is "1024". i.e. no utilization capping > @@ -1000,6 +1010,15 @@ All time durations are in microseconds. > values similar to the sched_setattr(2). This maximum utilization > value is used to clamp the task specific maximum utilization clamp. > > + cpu.util.max.effective > + A read-only single value file which exists on non-root cgroups and > + reports maximum utilization clamp value currently enforced on a task > + group. > + > + The actual maximum utilization in the range [0, 1024]. > + > + This value can be lower then cpu.util.max in case a parent cgroup > + is enforcing a more restrictive clamping on max utilization. > > > Memory > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 122ab069ade5..1e54517acd58 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -720,6 +720,18 @@ static void set_load_weight(struct task_struct *p, bool update_load) > } > > #ifdef CONFIG_UCLAMP_TASK > +/* > + * Serializes updates of utilization clamp values > + * > + * The (slow-path) user-space triggers utilization clamp value updates which > + * can require updates on (fast-path) scheduler's data structures used to > + * support enqueue/dequeue operations. > + * While the per-CPU rq lock protects fast-path update operations, user-space > + * requests are serialized using a mutex to reduce the risk of conflicting > + * updates or API abuses. > + */ > +static DEFINE_MUTEX(uclamp_mutex); > + > /* Max allowed minimum utilization */ > unsigned int sysctl_sched_uclamp_util_min = SCHED_CAPACITY_SCALE; > > @@ -1127,6 +1139,8 @@ static void __init init_uclamp(void) > unsigned int value; > int cpu; > > + mutex_init(&uclamp_mutex); > + > for_each_possible_cpu(cpu) { > memset(&cpu_rq(cpu)->uclamp, 0, sizeof(struct uclamp_rq)); > cpu_rq(cpu)->uclamp_flags = 0; > @@ -6758,6 +6772,10 @@ static inline int alloc_uclamp_sched_group(struct task_group *tg, > parent->uclamp[clamp_id].value; > tg->uclamp[clamp_id].bucket_id = > parent->uclamp[clamp_id].bucket_id; > + tg->uclamp[clamp_id].effective.value = > + parent->uclamp[clamp_id].effective.value; > + tg->uclamp[clamp_id].effective.bucket_id = > + parent->uclamp[clamp_id].effective.bucket_id; > } > #endif > > @@ -7011,6 +7029,53 @@ static void cpu_cgroup_attach(struct cgroup_taskset *tset) > } > > #ifdef CONFIG_UCLAMP_TASK_GROUP > +static void cpu_util_update_hier(struct cgroup_subsys_state *css, s/cpu_util_update_hier/cpu_util_update_heir ? > + unsigned int clamp_id, unsigned int bucket_id, > + unsigned int value) > +{ > + struct cgroup_subsys_state *top_css = css; > + struct uclamp_se *uc_se, *uc_parent; > + > + css_for_each_descendant_pre(css, top_css) { > + /* > + * The first visited task group is top_css, which clamp value > + * is the one passed as parameter. For descendent task > + * groups we consider their current value. > + */ > + uc_se = &css_tg(css)->uclamp[clamp_id]; > + if (css != top_css) { > + value = uc_se->value; > + bucket_id = uc_se->effective.bucket_id; > + } > + uc_parent = NULL; > + if (css_tg(css)->parent) > + uc_parent = &css_tg(css)->parent->uclamp[clamp_id]; > + > + /* > + * Skip the whole subtrees if the current effective clamp is > + * already matching the TG's clamp value. > + * In this case, all the subtrees already have top_value, or a > + * more restrictive value, as effective clamp. > + */ > + if (uc_se->effective.value == value && > + uc_parent && uc_parent->effective.value >= value) { > + css = css_rightmost_descendant(css); > + continue; > + } > + > + /* Propagate the most restrictive effective value */ > + if (uc_parent && uc_parent->effective.value < value) { > + value = uc_parent->effective.value; > + bucket_id = uc_parent->effective.bucket_id; > + } > + if (uc_se->effective.value == value) > + continue; > + > + uc_se->effective.value = value; > + uc_se->effective.bucket_id = bucket_id; > + } > +} > + > static int cpu_util_min_write_u64(struct cgroup_subsys_state *css, > struct cftype *cftype, u64 min_value) > { > @@ -7020,6 +7085,7 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css, > if (min_value > SCHED_CAPACITY_SCALE) > return -ERANGE; > > + mutex_lock(&uclamp_mutex); > rcu_read_lock(); > > tg = css_tg(css); > @@ -7038,8 +7104,13 @@ static int cpu_util_min_write_u64(struct cgroup_subsys_state *css, > tg->uclamp[UCLAMP_MIN].value = min_value; > tg->uclamp[UCLAMP_MIN].bucket_id = uclamp_bucket_id(min_value); > > + /* Update effective clamps to track the most restrictive value */ > + cpu_util_update_hier(css, UCLAMP_MIN, tg->uclamp[UCLAMP_MIN].bucket_id, > + min_value); > + > out: > rcu_read_unlock(); > + mutex_unlock(&uclamp_mutex); > > return ret; > } > @@ -7053,6 +7124,7 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css, > if (max_value > SCHED_CAPACITY_SCALE) > return -ERANGE; > > + mutex_lock(&uclamp_mutex); > rcu_read_lock(); > > tg = css_tg(css); > @@ -7071,21 +7143,29 @@ static int cpu_util_max_write_u64(struct cgroup_subsys_state *css, > tg->uclamp[UCLAMP_MAX].value = max_value; > tg->uclamp[UCLAMP_MAX].bucket_id = uclamp_bucket_id(max_value); > > + /* Update effective clamps to track the most restrictive value */ > + cpu_util_update_hier(css, UCLAMP_MAX, tg->uclamp[UCLAMP_MAX].bucket_id, > + max_value); > + > out: > rcu_read_unlock(); > + mutex_unlock(&uclamp_mutex); > > return ret; > } > > static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css, > - enum uclamp_id clamp_id) > + enum uclamp_id clamp_id, > + bool effective) > { > struct task_group *tg; > u64 util_clamp; > > rcu_read_lock(); > tg = css_tg(css); > - util_clamp = tg->uclamp[clamp_id].value; > + util_clamp = effective > + ? tg->uclamp[clamp_id].effective.value > + : tg->uclamp[clamp_id].value; > rcu_read_unlock(); > > return util_clamp; > @@ -7094,13 +7174,25 @@ static inline u64 cpu_uclamp_read(struct cgroup_subsys_state *css, > static u64 cpu_util_min_read_u64(struct cgroup_subsys_state *css, > struct cftype *cft) > { > - return cpu_uclamp_read(css, UCLAMP_MIN); > + return cpu_uclamp_read(css, UCLAMP_MIN, false); > } > > static u64 cpu_util_max_read_u64(struct cgroup_subsys_state *css, > struct cftype *cft) > { > - return cpu_uclamp_read(css, UCLAMP_MAX); > + return cpu_uclamp_read(css, UCLAMP_MAX, false); > +} > + > +static u64 cpu_util_min_effective_read_u64(struct cgroup_subsys_state *css, > + struct cftype *cft) > +{ > + return cpu_uclamp_read(css, UCLAMP_MIN, true); > +} > + > +static u64 cpu_util_max_effective_read_u64(struct cgroup_subsys_state *css, > + struct cftype *cft) > +{ > + return cpu_uclamp_read(css, UCLAMP_MAX, true); > } > #endif /* CONFIG_UCLAMP_TASK_GROUP */ > > @@ -7448,11 +7540,19 @@ static struct cftype cpu_legacy_files[] = { > .read_u64 = cpu_util_min_read_u64, > .write_u64 = cpu_util_min_write_u64, > }, > + { > + .name = "util.min.effective", > + .read_u64 = cpu_util_min_effective_read_u64, > + }, > { > .name = "util.max", > .read_u64 = cpu_util_max_read_u64, > .write_u64 = cpu_util_max_write_u64, > }, > + { > + .name = "util.max.effective", > + .read_u64 = cpu_util_max_effective_read_u64, > + }, > #endif > { } /* Terminate */ > }; > @@ -7628,12 +7728,22 @@ static struct cftype cpu_files[] = { > .read_u64 = cpu_util_min_read_u64, > .write_u64 = cpu_util_min_write_u64, > }, > + { > + .name = "util.min.effective", > + .flags = CFTYPE_NOT_ON_ROOT, > + .read_u64 = cpu_util_min_effective_read_u64, > + }, > { > .name = "util.max", > .flags = CFTYPE_NOT_ON_ROOT, > .read_u64 = cpu_util_max_read_u64, > .write_u64 = cpu_util_max_write_u64, > }, > + { > + .name = "util.max.effective", > + .flags = CFTYPE_NOT_ON_ROOT, > + .read_u64 = cpu_util_max_effective_read_u64, > + }, > #endif > { } /* terminate */ > }; > -- > 2.20.1 >