On Fri, Dec 6, 2019 at 5:04 PM Xuewei Zhang <xueweiz@xxxxxxxxxx> wrote: Sorry. I forgot to add the upstream line: "commit 4929a4e6faa0f13289a67cae98139e727f0d4a97 upstream." Sending a new patch. Please ignore this one. Sorry about that. > > The quota/period ratio is used to ensure a child task group won't get > more bandwidth than the parent task group, and is calculated as: > > normalized_cfs_quota() = [(quota_us << 20) / period_us] > > If the quota/period ratio was changed during this scaling due to > precision loss, it will cause inconsistency between parent and child > task groups. > > See below example: > > A userspace container manager (kubelet) does three operations: > > 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us. > 2) Create a few children cgroups. > 3) Set quota to 1,000us and period to 10,000us on a child cgroup. > > These operations are expected to succeed. However, if the scaling of > 147/128 happens before step 3, quota and period of the parent cgroup > will be changed: > > new_quota: 1148437ns, 1148us > new_period: 11484375ns, 11484us > > And when step 3 comes in, the ratio of the child cgroup will be > 104857, which will be larger than the parent cgroup ratio (104821), > and will fail. > > Scaling them by a factor of 2 will fix the problem. > > Tested-by: Phil Auld <pauld@xxxxxxxxxx> > Signed-off-by: Xuewei Zhang <xueweiz@xxxxxxxxxx> > Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx> > Acked-by: Phil Auld <pauld@xxxxxxxxxx> > Cc: Anton Blanchard <anton@xxxxxxxxxx> > Cc: Ben Segall <bsegall@xxxxxxxxxx> > Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx> > Cc: Juri Lelli <juri.lelli@xxxxxxxxxx> > Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > Cc: Mel Gorman <mgorman@xxxxxxx> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Steven Rostedt <rostedt@xxxxxxxxxxx> > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx> > Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup") > Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@xxxxxxxxxx > Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx> > --- > kernel/sched/fair.c | 36 ++++++++++++++++++++++-------------- > 1 file changed, 22 insertions(+), 14 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f77fcd37b226..f0abb8fe0ae9 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4868,20 +4868,28 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer) > if (++count > 3) { > u64 new, old = ktime_to_ns(cfs_b->period); > > - new = (old * 147) / 128; /* ~115% */ > - new = min(new, max_cfs_quota_period); > - > - cfs_b->period = ns_to_ktime(new); > - > - /* since max is 1s, this is limited to 1e9^2, which fits in u64 */ > - cfs_b->quota *= new; > - cfs_b->quota = div64_u64(cfs_b->quota, old); > - > - pr_warn_ratelimited( > - "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us %lld, cfs_quota_us = %lld)\n", > - smp_processor_id(), > - div_u64(new, NSEC_PER_USEC), > - div_u64(cfs_b->quota, NSEC_PER_USEC)); > + /* > + * Grow period by a factor of 2 to avoid losing precision. > + * Precision loss in the quota/period ratio can cause __cfs_schedulable > + * to fail. > + */ > + new = old * 2; > + if (new < max_cfs_quota_period) { > + cfs_b->period = ns_to_ktime(new); > + cfs_b->quota *= 2; > + > + pr_warn_ratelimited( > + "cfs_period_timer[cpu%d]: period too short, scaling up (new cfs_period_us = %lld, cfs_quota_us = %lld)\n", > + smp_processor_id(), > + div_u64(new, NSEC_PER_USEC), > + div_u64(cfs_b->quota, NSEC_PER_USEC)); > + } else { > + pr_warn_ratelimited( > + "cfs_period_timer[cpu%d]: period too short, but cannot scale up without losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n", > + smp_processor_id(), > + div_u64(old, NSEC_PER_USEC), > + div_u64(cfs_b->quota, NSEC_PER_USEC)); > + } > > /* reset count so we don't come right back in here */ > count = 0; > -- > 2.24.0.393.g34dc348eaf-goog >