On Wed, May 29, 2019 at 02:05:55PM -0700, bsegall@xxxxxxxxxx wrote: > Dave Chiluk <chiluk+linux@xxxxxxxxxx> writes: > > Yeah, having run the test, stranding only 1 ms per cpu rather than 5 > doesn't help if you only have 10 ms of quota and even 10 threads/cpus. > The slack timer isn't important in this test, though I think it probably > should be changed. My min_cfs_rq_runtime was already set to 1ms. Additionally raising the amount of quota from 10ms to 50ms or even 100ms, still results in throttling without full quota usage. > Decreasing min_cfs_rq_runtime helps, but would mean that we have to pull > quota more often / always. The worst case here I think is where you > run/sleep for ~1ns, so you wind up taking the lock twice every > min_cfs_rq_runtime: once for assign and once to return all but min, > which you then use up doing short run/sleep. I suppose that determines > how much we care about this overhead at all. I'm not so concerned about how inefficiently the user-space application runs, as that's up to the invidual developer. The fibtest testcase, is purely my approximation of what a java application with lots of worker threads might do, as I didn't have a great deterministic java reproducer, and I feared posting java to LKML. I'm more concerned with the fact that the user requested 10ms/period or 100ms/period and they hit throttling while simultaneously not seeing that amount of cpu usage. i.e. on an 8 core machine if I $ ./runfibtest 1 Iterations Completed(M): 1886 Throttled for: 51 CPU Usage (msecs) = 507 $ ./runfibtest 8 Iterations Completed(M): 1274 Throttled for: 52 CPU Usage (msecs) = 380 You see that in the 8 core case where we have 7 do nothing threads on cpu's 1-7, we see only 380 ms of usage, and 52 periods of throttling when we should have received ~500ms of cpu usage. Looking more closely at the __return_cfs_rq_runtime logic I noticed if (cfs_b->quota != RUNTIME_INF && cfs_rq->runtime_expires == cfs_b->runtime_expires) { Which is awfully similar to the logic that was fixed by 512ac999. Is it possible that we are just not ever returning runtime back to the cfs_b because of the runtime_expires comparison here? Early on in my testing I created a patch that would report via sysfs the amount of quota that was expired at the end of a period in expire_cfs_rq_runtime, and it roughly equalled the difference in quota between the actual cpu usage and the alloted quota. That leads me to believe that something is not going quite correct in the slack return logic __return_cfs_rq_runtime. I'll attach that patch at the bottom of this e-mail in case you'd like to reproduce my tests. > Removing expiration means that in the worst case period and quota can be > effectively twice what the user specified, but only on very particular > workloads. I'm only removing expiration of slices that have already been assigned to individual cfs_rq. My understanding is that there is at most one cfs_rq per cpu, and each of those can have at most one slice of available runtime. So the worst case burst is slice_ms * cpus. Please help me understand how you get to twice user specified quota and period as it's not obvious to me *(I've only been looking at this for a few months). > I think we should at least think about instead lowering > min_cfs_rq_runtime to some smaller value Do you mean lower than 1ms? Thanks Dave. From: Dave Chiluk <chiluk+linux@xxxxxxxxxx> Date: Thu, 30 May 2019 12:47:12 -0500 Subject: [PATCH] Add expired_time sysfs entry Signed-off-by: Dave Chiluk <chiluk+linux@xxxxxxxxxx> --- kernel/sched/core.c | 41 +++++++++++++++++++++++++++++++++++++++++ kernel/sched/fair.c | 4 ++++ kernel/sched/sched.h | 1 + 3 files changed, 46 insertions(+) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 874c427..3c06df9 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6655,6 +6655,30 @@ static long tg_get_cfs_period(struct task_group *tg) return cfs_period_us; } +static int tg_set_cfs_expired_runtime(struct task_group *tg, long slice_expiration) +{ + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; + + if (slice_expiration == 0) + { + raw_spin_lock_irq(&cfs_b->lock); + cfs_b->expired_runtime= 0; + raw_spin_unlock_irq(&cfs_b->lock); + return 0; + } + return 1; +} + +static u64 tg_get_cfs_expired_runtime(struct task_group *tg) +{ + u64 expired_runtime; + + expired_runtime = tg->cfs_bandwidth.expired_runtime; + do_div(expired_runtime, NSEC_PER_USEC); + + return expired_runtime; +} + static s64 cpu_cfs_quota_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) { @@ -6679,6 +6703,18 @@ static int cpu_cfs_period_write_u64(struct cgroup_subsys_state *css, return tg_set_cfs_period(css_tg(css), cfs_period_us); } +static u64 cpu_cfs_expired_runtime_read_u64(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + return tg_get_cfs_expired_runtime(css_tg(css)); +} + +static int cpu_cfs_expired_runtime_write_s64(struct cgroup_subsys_state *css, + struct cftype *cftype, s64 cfs_slice_expiration_us) +{ + return tg_set_cfs_expired_runtime(css_tg(css), cfs_slice_expiration_us); +} + struct cfs_schedulable_data { struct task_group *tg; u64 period, quota; @@ -6832,6 +6868,11 @@ static u64 cpu_rt_period_read_uint(struct cgroup_subsys_state *css, .write_u64 = cpu_cfs_period_write_u64, }, { + .name = "cfs_expired_runtime", + .read_u64 = cpu_cfs_expired_runtime_read_u64, + .write_s64 = cpu_cfs_expired_runtime_write_s64, + }, + { .name = "stat", .seq_show = cpu_cfs_stat_show, }, diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index f35930f..bcbd4ef 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4382,6 +4382,9 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) cfs_rq->runtime_expires += TICK_NSEC; } else { /* global deadline is ahead, expiration has passed */ + raw_spin_lock_irq(&cfs_b->lock); + cfs_b->expired_runtime += cfs_rq->runtime_remaining; + raw_spin_unlock_irq(&cfs_b->lock); cfs_rq->runtime_remaining = 0; } } @@ -4943,6 +4946,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->runtime = 0; cfs_b->quota = RUNTIME_INF; cfs_b->period = ns_to_ktime(default_cfs_period()); + cfs_b->expired_runtime = 0; INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq); hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b52ed1a..499d2e2 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -343,6 +343,7 @@ struct cfs_bandwidth { s64 hierarchical_quota; u64 runtime_expires; int expires_seq; + u64 expired_runtime; short idle; short period_active; -- 1.8.3.1