On 08/11/2017 12:37 PM, Tejun Heo wrote: > In cgroup1, while cpuacct isn't actually controlling any resources, it > is a separate controller due to combinaton of two factors - > 1. enabling cpu controller has significant side effects, and 2. we > have to pick one of the hierarchies to account CPU usages on. cpuacct > controller is effectively used to designate a hierarchy to track CPU > usages on. > > cgroup2's unified hierarchy removes the second reason and we can > account basic CPU usages by default. While we can use cpuacct for > this purpose, both its interface and implementation leave a lot to be > desired - it collects and exposes two sources of truth which don't > agree with each other and some of the exposed statistics don't make > much sense. Also, it propagates all the way up the hierarchy on each > accounting event which is unnecessary. > > This patch adds basic resource accounting mechanism to cgroup2's > unified hierarchy and accounts CPU usages using it. > > * All accountings are done per-cpu and don't propagate immediately. > It just bumps the per-cgroup per-cpu counters and links to the > parent's updated list if not already on it. > > * On a read, the per-cpu counters are collected into the global ones > and then propagated upwards. Only the per-cpu counters which have > changed since the last read are propagated. > > * CPU usage stats are collected and shown in "cgroup.stat" with "cpu." > prefix. Total usage is collected from scheduling events. User/sys > breakdown is sourced from tick sampling and adjusted to the usage > using cputime_adjuts(). Typo: cputime_adjust() > This keeps the accounting side hot path O(1) and per-cpu and the read > side O(nr_updated_since_last_read). > > Signed-off-by: Tejun Heo <tj@xxxxxxxxxx> > Cc: Ingo Molnar <mingo@xxxxxxxxxx> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Li Zefan <lizefan@xxxxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > --- > Documentation/cgroup-v2.txt | 9 ++ > include/linux/cgroup-defs.h | 47 ++++++ > include/linux/cgroup.h | 22 +++ > kernel/cgroup/Makefile | 2 +- > kernel/cgroup/cgroup-internal.h | 8 + > kernel/cgroup/cgroup.c | 24 ++- > kernel/cgroup/stat.c | 333 ++++++++++++++++++++++++++++++++++++++++ > 7 files changed, 442 insertions(+), 3 deletions(-) > create mode 100644 kernel/cgroup/stat.c > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt > index dc44785..3f82169 100644 > --- a/Documentation/cgroup-v2.txt > +++ b/Documentation/cgroup-v2.txt > @@ -886,6 +886,15 @@ All cgroup core files are prefixed with "cgroup." > A dying cgroup can consume system resources not exceeding > limits, which were active at the moment of cgroup deletion. > > + cpu.usage_usec > + CPU time consumed in the subtree. > + > + cpu.user_usec > + User CPU time consumed in the subtree. > + > + cpu.system_usec > + System CPU time consumed in the subtree. > + > > Controllers > =========== > diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h > index 59e4ad9..17da9c8 100644 > --- a/include/linux/cgroup-defs.h > +++ b/include/linux/cgroup-defs.h > @@ -16,6 +16,7 @@ > #include <linux/refcount.h> > #include <linux/percpu-refcount.h> > #include <linux/percpu-rwsem.h> > +#include <linux/u64_stats_sync.h> > #include <linux/workqueue.h> > #include <linux/bpf-cgroup.h> > > @@ -249,6 +250,47 @@ struct css_set { > struct rcu_head rcu_head; > }; > > +/* > + * cgroup basic resource usage statistics. Accounting is done per-cpu in > + * cgroup_cpu_stat which is then lazily propagated up the hierarchy on > + * reads. The propagation is selective - only the cgroup_cpu_stats which > + * have been updated since the last propagation are propagated. > + */ > +struct cgroup_cpu_stat { > + /* > + * ->sync protects all the current counters. These are the only > + * fields which get updated in the hot path. > + */ > + struct u64_stats_sync sync; > + struct task_cputime cputime; > + > + /* > + * Snapshots at the last reading. These are used to calculate the > + * deltas to propagate to the global counters. > + */ > + struct task_cputime last_cputime; > + > + /* > + * Child cgroups with stat updates on this cpu since the last read > + * are linked on the parent's ->updated_children through > + * ->updated_next. > + * > + * In addition to being more compact, singly-linked list pointing > + * to the cgroup makes it unnecessary for each per-cpu struct to > + * point back to the associated cgroup. > + * > + * Protected by per-cpu cgroup_cpu_stat_lock. > + */ > + struct cgroup *updated_children; /* terminated by self cgroup */ > + struct cgroup *updated_next; /* NULL iff not on the list */ > +}; > + > +struct cgroup_stat { > + /* per-cpu statistics are collected into the folowing global counters */ > + struct task_cputime cputime; > + struct prev_cputime prev_cputime; > +}; > + > struct cgroup { > /* self css with NULL ->ss, points back to this cgroup */ > struct cgroup_subsys_state self; > @@ -348,6 +390,11 @@ struct cgroup { > */ > struct cgroup *dom_cgrp; > > + /* cgroup basic resource statistics */ > + struct cgroup_cpu_stat __percpu *cpu_stat; > + struct cgroup_stat pending_stat; /* pending from children */ > + struct cgroup_stat stat; > + > /* > * list of pidlists, up to two for each namespace (one for procs, one > * for tasks); created on demand. > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > index f395e02..4c82212 100644 > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -689,17 +689,39 @@ static inline void cpuacct_account_field(struct task_struct *tsk, int index, > u64 val) {} > #endif > > +void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix); > + > +void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec); > +void __cgroup_account_cputime_field(struct cgroup *cgrp, > + enum cpu_usage_stat index, u64 delta_exec); > + > static inline void cgroup_account_cputime(struct task_struct *task, > u64 delta_exec) > { > + struct cgroup *cgrp; > + > cpuacct_charge(task, delta_exec); > + > + rcu_read_lock(); > + cgrp = task_dfl_cgroup(task); > + if (cgroup_parent(cgrp)) > + __cgroup_account_cputime(cgrp, delta_exec); > + rcu_read_unlock(); > } > > static inline void cgroup_account_cputime_field(struct task_struct *task, > enum cpu_usage_stat index, > u64 delta_exec) > { > + struct cgroup *cgrp; > + > cpuacct_account_field(task, index, delta_exec); > + > + rcu_read_lock(); > + cgrp = task_dfl_cgroup(task); > + if (cgroup_parent(cgrp)) > + __cgroup_account_cputime_field(cgrp, index, delta_exec); > + rcu_read_unlock(); > } > > #else /* CONFIG_CGROUPS */ > diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile > index ce693cc..0acee61 100644 > --- a/kernel/cgroup/Makefile > +++ b/kernel/cgroup/Makefile > @@ -1,4 +1,4 @@ > -obj-y := cgroup.o namespace.o cgroup-v1.o > +obj-y := cgroup.o stat.o namespace.o cgroup-v1.o > > obj-$(CONFIG_CGROUP_FREEZER) += freezer.o > obj-$(CONFIG_CGROUP_PIDS) += pids.o > diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h > index c167a40..f17da09 100644 > --- a/kernel/cgroup/cgroup-internal.h > +++ b/kernel/cgroup/cgroup-internal.h > @@ -197,6 +197,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node, > int cgroup_task_count(const struct cgroup *cgrp); > > /* > + * stat.c > + */ > +void cgroup_stat_flush(struct cgroup *cgrp); > +int cgroup_stat_init(struct cgroup *cgrp); > +void cgroup_stat_exit(struct cgroup *cgrp); > +void cgroup_stat_boot(void); > + > +/* > * namespace.c > */ > extern const struct proc_ns_operations cgroupns_operations; > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c > index c038ccf..a399a55 100644 > --- a/kernel/cgroup/cgroup.c > +++ b/kernel/cgroup/cgroup.c > @@ -142,12 +142,14 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = { > }; > #undef SUBSYS > > +static DEFINE_PER_CPU(struct cgroup_cpu_stat, cgrp_dfl_root_cpu_stat); > + > /* > * The default hierarchy, reserved for the subsystems that are otherwise > * unattached - it never has more than a single cgroup, and all tasks are > * part of that cgroup. > */ > -struct cgroup_root cgrp_dfl_root; > +struct cgroup_root cgrp_dfl_root = { .cgrp.cpu_stat = &cgrp_dfl_root_cpu_stat }; > EXPORT_SYMBOL_GPL(cgrp_dfl_root); > > /* > @@ -3296,6 +3298,8 @@ static int cgroup_stat_show(struct seq_file *seq, void *v) > seq_printf(seq, "nr_dying_descendants %d\n", > cgroup->nr_dying_descendants); > > + cgroup_stat_show_cputime(seq, "cpu."); > + > return 0; > } > > @@ -4466,6 +4470,8 @@ static void css_free_work_fn(struct work_struct *work) > */ > cgroup_put(cgroup_parent(cgrp)); > kernfs_put(cgrp->kn); > + if (cgroup_on_dfl(cgrp)) > + cgroup_stat_exit(cgrp); > kfree(cgrp); > } else { > /* > @@ -4510,6 +4516,9 @@ static void css_release_work_fn(struct work_struct *work) > /* cgroup release path */ > trace_cgroup_release(cgrp); > > + if (cgroup_on_dfl(cgrp)) > + cgroup_stat_flush(cgrp); > + > for (tcgrp = cgroup_parent(cgrp); tcgrp; > tcgrp = cgroup_parent(tcgrp)) > tcgrp->nr_dying_descendants--; > @@ -4696,6 +4705,12 @@ static struct cgroup *cgroup_create(struct cgroup *parent) > if (ret) > goto out_free_cgrp; > > + if (cgroup_on_dfl(parent)) { > + ret = cgroup_stat_init(cgrp); > + if (ret) > + goto out_cancel_ref; > + } > + > /* > * Temporarily set the pointer to NULL, so idr_find() won't return > * a half-baked cgroup. > @@ -4703,7 +4718,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent) > cgrp->id = cgroup_idr_alloc(&root->cgroup_idr, NULL, 2, 0, GFP_KERNEL); > if (cgrp->id < 0) { > ret = -ENOMEM; > - goto out_cancel_ref; > + goto out_stat_exit; > } > > init_cgroup_housekeeping(cgrp); > @@ -4752,6 +4767,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent) > > return cgrp; > > +out_stat_exit: > + if (cgroup_on_dfl(parent)) > + cgroup_stat_exit(cgrp); > out_cancel_ref: > percpu_ref_exit(&cgrp->self.refcnt); > out_free_cgrp: > @@ -5146,6 +5164,8 @@ int __init cgroup_init(void) > BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files)); > BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files)); > > + cgroup_stat_boot(); > + > /* > * The latency of the synchronize_sched() is too high for cgroups, > * avoid it at the cost of forcing all readers into the slow path. > diff --git a/kernel/cgroup/stat.c b/kernel/cgroup/stat.c > new file mode 100644 > index 0000000..19a10b2 > --- /dev/null > +++ b/kernel/cgroup/stat.c > @@ -0,0 +1,333 @@ > +#include "cgroup-internal.h" > + > +#include <linux/sched/cputime.h> > + > +static DEFINE_MUTEX(cgroup_stat_mutex); > +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_cpu_stat_lock); If the hierarchy is large enough and the stat data hasn't been read recently, it may take a while to accumulate all the stat data even for one cpu in cgroup_stat_flush_locked(). So I think it will make more sense to use regular spinlock_t instead of raw_spinlock_t. > + > +static struct cgroup_cpu_stat *cgroup_cpu_stat(struct cgroup *cgrp, int cpu) > +{ > + return per_cpu_ptr(cgrp->cpu_stat, cpu); > +} > + > +/** > + * cgroup_cpu_stat_updated - keep track of updated cpu_stat > + * @cgrp: target cgroup > + * @cpu: cpu on which cpu_stat was updated > + * > + * @cgrp's cpu_stat on @cpu was updated. Put it on the parent's matching > + * cpu_stat->updated_children list. > + */ > +static void cgroup_cpu_stat_updated(struct cgroup *cgrp, int cpu) > +{ > + raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu); > + struct cgroup *parent; > + unsigned long flags; > + > + /* > + * Speculative already-on-list test. This may race leading to > + * temporary inaccuracies, which is fine. > + * > + * Because @parent's updated_children is terminated with @parent > + * instead of NULL, we can tell whether @cgrp is on the list by > + * testing the next pointer for NULL. > + */ > + if (cgroup_cpu_stat(cgrp, cpu)->updated_next) > + return; > + > + raw_spin_lock_irqsave(cpu_lock, flags); > + > + /* put @cgrp and all ancestors on the corresponding updated lists */ > + for (parent = cgroup_parent(cgrp); parent; > + cgrp = parent, parent = cgroup_parent(cgrp)) { > + struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu); > + struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu); > + > + /* > + * Both additions and removals are bottom-up. If a cgroup > + * is already in the tree, all ancestors are. > + */ > + if (cstat->updated_next) > + break; > + > + cstat->updated_next = pcstat->updated_children; > + pcstat->updated_children = cgrp; > + } > + > + raw_spin_unlock_irqrestore(cpu_lock, flags); > +} > + > +/** > + * cgroup_cpu_stat_next_updated - iterate cpu_stat updated tree > + * @pos: current position > + * @root: root of the tree to traversal > + * @cpu: target cpu > + * > + * Walks the udpated cpu_stat tree on @cpu from @root. %NULL @pos starts > + * the traversal and %NULL return indicates the end. During traversal, > + * each returned cgroup is unlinked from the tree. Must be called with the > + * matching cgroup_cpu_stat_lock held. > + * > + * The only ordering guarantee is that, for a parent and a child pair > + * covered by a given traversal, if a child is visited, its parent is > + * guaranteed to be visited afterwards. > + */ > +static struct cgroup *cgroup_cpu_stat_next_updated(struct cgroup *pos, > + struct cgroup *root, int cpu) This function is trying to unwind one cgrp from the updated_children and updated_next linkage. It is somewhat like the opposite of cgroup_cpu_stat_updated(). I just feel like its name isn't intuitive enough to convey what it is doing. Maybe use name like cgroup_cpu_stat_unlink_one() to match cgroup_cpu_stat_flush_one(). Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html