@Ben Segall, @Ingo Molnar, @Peter Zijlstra : This would be really useful, can you please voice your concerns/frustrations so they can be addressed? This should help increasing density of our bursty web-applications. @Konstantin, please add a reviewed by by-line. As I don't do this regularly, I need all the attribution I can get. Reviewed-by: Dave Chiluk <chiluk+linux@xxxxxxxxxx> Thanks, Dave On Tue, Nov 26, 2019 at 4:56 AM Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> wrote: > > Currently CFS bandwidth controller assigns cpu.cfs_quota_us runtime into > global pool every cpu.cfs_period_us. All unused runtime is expired. > > Since commit de53fd7aedb1 ("sched/fair: Fix low cpu usage with high > throttling by removing expiration of cpu-local slices") slice assigned > to cpu does not expire. This allows to serve tiny bursts (upto 1ms), > but this runtime pool is cpu-bound and not transferred between cpus. > > Setup for interactive workload with irregular cpu consumption have to set > quota according to relatively short spikes of cpu usage. This eliminates > possibility of control for average cpu usage. Increasing period and quota > proportionally for getting bigger runtime chunks is not an option because > if even bigger spike deplete global pool then execution will stuck until > end of period and next refill. > > This patch adds limited accumulation of unused runtime from past periods. > Accumulated runtime does not expire. It stays in global pool and could be > used by any cpu. Average cpu usage stays limited with quota / period, > but spiky workload could use more cpu power for a short period of time. > > Size of pool for burst runtime is set in attribute cpu.cfs_burst_us. > Default is 0, which reflects current behavior. > > Statistics for used bust runtime is shown in cpu.stat as "burst_time". > > Example setup: > > cpu.cfs_period_us = 100000 > cpu.cfs_quota_us = 200000 > cpu.cfs_burst_us = 300000 > > Average cpu usage stays limited with 2 cpus (quota / period), but cgroup > could accumulate runtime (burst) and for 100ms could utilize up to 5 cpus > (quota / period + burst / 100ms), or 3 cpus for 300ms, an so on. > > For example, in this cgroup sample workload with bursts: > > fio --name=test --ioengine=cpuio --time_based=1 --runtime=600 \ > --cpuload=10 --cpuchunks=100000 --numjobs=5 > > has 50% average cpu usage without throttling. Without enabling bursts > same command can utilize only 25% cpu and 25% time will be throttled. > > Implementation is simple. All logic is in __refill_cfs_bandwidth_runtime(). > Burst pool is kept in common pool thus runtime distribution is unchanged. > The rest changes are interface for cgroup and cgroup2. > > For cgroup2 burst is set as third number in attribute cpu.max: > cpu.max = $QUOTA $PERIOD $BURST > > At changing setup cgroup gets full charge of runtime and burst runtime. > > Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> > Cc: Dave Chiluk <chiluk+linux@xxxxxxxxxx> > Cc: Cong Wang <xiyou.wangcong@xxxxxxxxx> > > --- > > v2: fix spelling, add refilling burst runtime at changing setup per Dave Chiluk > > v1: https://lore.kernel.org/lkml/157312875706.707.12248531434112979828.stgit@buzz/ > > Cong Wang proposed similar feature: https://lore.kernel.org/patchwork/patch/907450/ > But with changing default behavior and without statistics. > --- > Documentation/admin-guide/cgroup-v2.rst | 15 +++-- > Documentation/scheduler/sched-bwc.rst | 8 ++- > kernel/sched/core.c | 94 +++++++++++++++++++++++++------ > kernel/sched/fair.c | 34 +++++++++-- > kernel/sched/sched.h | 4 + > 5 files changed, 124 insertions(+), 31 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index 5361ebec3361..8c3cc3d882ba 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -981,11 +981,12 @@ All time durations are in microseconds. > - user_usec > - system_usec > > - and the following three when the controller is enabled: > + and the following four when the controller is enabled: > > - nr_periods > - nr_throttled > - throttled_usec > + - burst_usec > > cpu.weight > A read-write single value file which exists on non-root > @@ -1006,16 +1007,18 @@ All time durations are in microseconds. > the closest approximation of the current weight. > > cpu.max > - A read-write two value file which exists on non-root cgroups. > - The default is "max 100000". > + A read-write 1..3 values file which exists on non-root cgroups. > + The default is "max 100000 0". > > The maximum bandwidth limit. It's in the following format:: > > - $MAX $PERIOD > + $MAX $PERIOD $BURST > > which indicates that the group may consume upto $MAX in each > - $PERIOD duration. "max" for $MAX indicates no limit. If only > - one number is written, $MAX is updated. > + $PERIOD duration and accumulates upto $BURST time for bursts. > + > + "max" for $MAX indicates no limit. > + If only one number is written, $MAX is updated. > > cpu.pressure > A read-only nested-key file which exists on non-root cgroups. > diff --git a/Documentation/scheduler/sched-bwc.rst b/Documentation/scheduler/sched-bwc.rst > index 9801d6b284b1..59de2ebe11b3 100644 > --- a/Documentation/scheduler/sched-bwc.rst > +++ b/Documentation/scheduler/sched-bwc.rst > @@ -27,12 +27,14 @@ Quota and period are managed within the cpu subsystem via cgroupfs. > > cpu.cfs_quota_us: the total available run-time within a period (in microseconds) > cpu.cfs_period_us: the length of a period (in microseconds) > +cpu.cfs_burst_us: the maxumum size of burst run-time pool (in microseconds) > cpu.stat: exports throttling statistics [explained further below] > > The default values are:: > > cpu.cfs_period_us=100ms > - cpu.cfs_quota=-1 > + cpu.cfs_quota_us=-1 > + cpu.cfs_burst_us=0 > > A value of -1 for cpu.cfs_quota_us indicates that the group does not have any > bandwidth restriction in place, such a group is described as an unconstrained > @@ -51,6 +53,9 @@ and return the group to an unconstrained state once more. > Any updates to a group's bandwidth specification will result in it becoming > unthrottled if it is in a constrained state. > > +Writing positive value into cpu.cfs_burst_us allows unused quota to > +accumulate up to this value and be used later in addition to assigned quota. > + > System wide settings > -------------------- > For efficiency run-time is transferred between the global pool and CPU local > @@ -75,6 +80,7 @@ cpu.stat: > - nr_throttled: Number of times the group has been throttled/limited. > - throttled_time: The total time duration (in nanoseconds) for which entities > of the group have been throttled. > +- burst_time: The total running time consumed from burst pool. > > This interface is read-only. > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 44123b4d14e8..c4c8ef521e7c 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -7364,7 +7364,8 @@ static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */ > > static int __cfs_schedulable(struct task_group *tg, u64 period, u64 runtime); > > -static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) > +static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota, > + u64 burst) > { > int i, ret = 0, runtime_enabled, runtime_was_enabled; > struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; > @@ -7409,12 +7410,17 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) > raw_spin_lock_irq(&cfs_b->lock); > cfs_b->period = ns_to_ktime(period); > cfs_b->quota = quota; > + cfs_b->burst = burst; > > - __refill_cfs_bandwidth_runtime(cfs_b); > - > - /* Restart the period timer (if active) to handle new period expiry: */ > - if (runtime_enabled) > + /* > + * Restart the period timer (if active) to handle new period expiry. > + * And refill runtime and burst-runtime to full charge. > + */ > + if (runtime_enabled) { > + cfs_b->burst_runtime = burst; > + cfs_b->runtime = quota + burst; > start_cfs_bandwidth(cfs_b); > + } > > raw_spin_unlock_irq(&cfs_b->lock); > > @@ -7442,9 +7448,10 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota) > > static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us) > { > - u64 quota, period; > + u64 quota, period, burst; > > period = ktime_to_ns(tg->cfs_bandwidth.period); > + burst = tg->cfs_bandwidth.burst; > if (cfs_quota_us < 0) > quota = RUNTIME_INF; > else if ((u64)cfs_quota_us <= U64_MAX / NSEC_PER_USEC) > @@ -7452,7 +7459,7 @@ static int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us) > else > return -EINVAL; > > - return tg_set_cfs_bandwidth(tg, period, quota); > + return tg_set_cfs_bandwidth(tg, period, quota, burst); > } > > static long tg_get_cfs_quota(struct task_group *tg) > @@ -7470,15 +7477,16 @@ static long tg_get_cfs_quota(struct task_group *tg) > > static int tg_set_cfs_period(struct task_group *tg, long cfs_period_us) > { > - u64 quota, period; > + u64 quota, period, burst; > > if ((u64)cfs_period_us > U64_MAX / NSEC_PER_USEC) > return -EINVAL; > > period = (u64)cfs_period_us * NSEC_PER_USEC; > quota = tg->cfs_bandwidth.quota; > + burst = tg->cfs_bandwidth.burst; > > - return tg_set_cfs_bandwidth(tg, period, quota); > + return tg_set_cfs_bandwidth(tg, period, quota, burst); > } > > static long tg_get_cfs_period(struct task_group *tg) > @@ -7491,6 +7499,28 @@ static long tg_get_cfs_period(struct task_group *tg) > return cfs_period_us; > } > > +static long tg_get_cfs_burst(struct task_group *tg) > +{ > + u64 cfs_burst_us = tg->cfs_bandwidth.burst; > + > + do_div(cfs_burst_us, NSEC_PER_USEC); > + return cfs_burst_us; > +} > + > +static int tg_set_cfs_burst(struct task_group *tg, long cfs_burst_us) > +{ > + u64 quota, period, burst; > + > + if ((u64)cfs_burst_us > U64_MAX / NSEC_PER_USEC) > + return -EINVAL; > + > + period = ktime_to_ns(tg->cfs_bandwidth.period); > + quota = tg->cfs_bandwidth.quota; > + burst = (u64)cfs_burst_us * NSEC_PER_USEC; > + > + return tg_set_cfs_bandwidth(tg, period, quota, burst); > +} > + > static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css, > struct cftype *cft) > { > @@ -7515,6 +7545,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css, > return tg_set_cfs_period(css_tg(css), cfs_period_us); > } > > +static u64 cpu_cfs_burst_read_u64(struct cgroup_subsys_state *css, > + struct cftype *cft) > +{ > + return tg_get_cfs_burst(css_tg(css)); > +} > + > +static int cpu_cfs_burst_write_u64(struct cgroup_subsys_state *css, > + struct cftype *cftype, u64 cfs_burst_us) > +{ > + return tg_set_cfs_burst(css_tg(css), cfs_burst_us); > +} > + > struct cfs_schedulable_data { > struct task_group *tg; > u64 period, quota; > @@ -7606,6 +7648,7 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void *v) > seq_printf(sf, "nr_periods %d\n", cfs_b->nr_periods); > seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled); > seq_printf(sf, "throttled_time %llu\n", cfs_b->throttled_time); > + seq_printf(sf, "burst_time %llu\n", cfs_b->burst_time); > > if (schedstat_enabled() && tg != &root_task_group) { > u64 ws = 0; > @@ -7667,6 +7710,11 @@ static struct cftype cpu_legacy_files[] = { > .read_u64 = cpu_cfs_period_read_u64, > .write_u64 = cpu_cfs_period_write_u64, > }, > + { > + .name = "cfs_burst_us", > + .read_u64 = cpu_cfs_burst_read_u64, > + .write_u64 = cpu_cfs_burst_write_u64, > + }, > { > .name = "stat", > .seq_show = cpu_cfs_stat_show, > @@ -7709,15 +7757,20 @@ static int cpu_extra_stat_show(struct seq_file *sf, > struct task_group *tg = css_tg(css); > struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; > u64 throttled_usec; > + u64 burst_usec; > > throttled_usec = cfs_b->throttled_time; > do_div(throttled_usec, NSEC_PER_USEC); > > + burst_usec = cfs_b->burst_time; > + do_div(burst_usec, NSEC_PER_USEC); > + > seq_printf(sf, "nr_periods %d\n" > "nr_throttled %d\n" > - "throttled_usec %llu\n", > + "throttled_usec %llu\n" > + "burst_usec %llu\n", > cfs_b->nr_periods, cfs_b->nr_throttled, > - throttled_usec); > + throttled_usec, burst_usec); > } > #endif > return 0; > @@ -7787,26 +7840,29 @@ static int cpu_weight_nice_write_s64(struct cgroup_subsys_state *css, > #endif > > static void __maybe_unused cpu_period_quota_print(struct seq_file *sf, > - long period, long quota) > + long period, long quota, > + long burst) > { > if (quota < 0) > seq_puts(sf, "max"); > else > seq_printf(sf, "%ld", quota); > > - seq_printf(sf, " %ld\n", period); > + seq_printf(sf, " %ld %ld\n", period, burst); > } > > /* caller should put the current value in *@periodp before calling */ > static int __maybe_unused cpu_period_quota_parse(char *buf, > - u64 *periodp, u64 *quotap) > + u64 *periodp, u64 *quotap, > + s64 *burstp) > { > char tok[21]; /* U64_MAX */ > > - if (sscanf(buf, "%20s %llu", tok, periodp) < 1) > + if (sscanf(buf, "%20s %llu %llu", tok, periodp, burstp) < 1) > return -EINVAL; > > *periodp *= NSEC_PER_USEC; > + *burstp *= NSEC_PER_USEC; > > if (sscanf(tok, "%llu", quotap)) > *quotap *= NSEC_PER_USEC; > @@ -7823,7 +7879,8 @@ static int cpu_max_show(struct seq_file *sf, void *v) > { > struct task_group *tg = css_tg(seq_css(sf)); > > - cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg)); > + cpu_period_quota_print(sf, tg_get_cfs_period(tg), tg_get_cfs_quota(tg), > + tg_get_cfs_burst(tg)); > return 0; > } > > @@ -7832,12 +7889,13 @@ static ssize_t cpu_max_write(struct kernfs_open_file *of, > { > struct task_group *tg = css_tg(of_css(of)); > u64 period = tg_get_cfs_period(tg); > + s64 burst = tg_get_cfs_burst(tg); > u64 quota; > int ret; > > - ret = cpu_period_quota_parse(buf, &period, "a); > + ret = cpu_period_quota_parse(buf, &period, "a, &burst); > if (!ret) > - ret = tg_set_cfs_bandwidth(tg, period, quota); > + ret = tg_set_cfs_bandwidth(tg, period, quota, burst); > return ret ?: nbytes; > } > #endif > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 69a81a5709ff..f32b72496932 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4353,16 +4353,26 @@ static inline u64 sched_cfs_bandwidth_slice(void) > } > > /* > - * Replenish runtime according to assigned quota. We use sched_clock_cpu > - * directly instead of rq->clock to avoid adding additional synchronization > - * around rq->lock. > + * Replenish runtime according to assigned quota. > + * Called only if quota != RUNTIME_INF. > * > * requires cfs_b->lock > */ > void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) > { > - if (cfs_b->quota != RUNTIME_INF) > - cfs_b->runtime = cfs_b->quota; > + u64 runtime = cfs_b->runtime; > + > + /* > + * Preserve past runtime up to burst size. If remaining runtime lower > + * than previous burst runtime then account delta as used burst time. > + */ > + if (runtime > cfs_b->burst) > + runtime = cfs_b->burst; > + else if (runtime < cfs_b->burst_runtime) > + cfs_b->burst_time += cfs_b->burst_runtime - runtime; > + > + cfs_b->burst_runtime = runtime; > + cfs_b->runtime = runtime + cfs_b->quota; > } > > static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > @@ -4968,6 +4978,9 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > cfs_b->runtime = 0; > cfs_b->quota = RUNTIME_INF; > cfs_b->period = ns_to_ktime(default_cfs_period()); > + cfs_b->burst = 0; > + cfs_b->burst_runtime = 0; > + cfs_b->burst_time = 0; > > INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq); > hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); > @@ -4986,14 +4999,23 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) > > void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > { > + u64 overrun; > + > lockdep_assert_held(&cfs_b->lock); > > if (cfs_b->period_active) > return; > > cfs_b->period_active = 1; > - hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); > + overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); > + > + /* > + * Refill runtime for periods of inactivity and current. > + * __refill_cfs_bandwidth_runtime() will cut excess. > + */ > + cfs_b->runtime += cfs_b->quota * overrun; > + __refill_cfs_bandwidth_runtime(cfs_b); > } > > static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index c8870c5bd7df..6edf6155e4f6 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -344,10 +344,14 @@ struct cfs_bandwidth { > struct hrtimer slack_timer; > struct list_head throttled_cfs_rq; > > + u64 burst; > + u64 burst_runtime; > + > /* Statistics: */ > int nr_periods; > int nr_throttled; > u64 throttled_time; > + u64 burst_time; > #endif > }; > >