On Fri, Jan 14, 2022 at 10:31:55AM +0100, Peter Zijlstra wrote: > On Wed, Jan 05, 2022 at 07:46:55PM -0500, Daniel Jordan wrote: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 44c452072a1b..3c2d7f245c68 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -4655,10 +4655,19 @@ static inline u64 sched_cfs_bandwidth_slice(void) > > */ > > void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) > > { > > - if (unlikely(cfs_b->quota == RUNTIME_INF)) > > + u64 quota = cfs_b->quota; > > + u64 payment; > > + > > + if (unlikely(quota == RUNTIME_INF)) > > return; > > > > - cfs_b->runtime += cfs_b->quota; > > + if (cfs_b->debt) { > > + payment = min(quota, cfs_b->debt); > > + cfs_b->debt -= payment; > > + quota -= payment; > > + } > > + > > + cfs_b->runtime += quota; > > cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst); > > } > > It might be easier to make cfs_bandwidth::runtime an s64 and make it go > negative. Yep, nice, no need for a new field in cfs_bandwidth. > > @@ -5406,6 +5415,32 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq) > > rcu_read_unlock(); > > } > > > > +static void incur_cfs_debt(struct rq *rq, struct sched_entity *se, > > + struct task_group *tg, u64 debt) > > +{ > > + if (!cfs_bandwidth_used()) > > + return; > > + > > + while (tg != &root_task_group) { > > + struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; > > + > > + if (cfs_rq->runtime_enabled) { > > + struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth; > > + u64 payment; > > + > > + raw_spin_lock(&cfs_b->lock); > > + > > + payment = min(cfs_b->runtime, debt); > > + cfs_b->runtime -= payment; > > At this point it might hit 0 (or go negative if/when you do the above) > and you'll need to throttle the group. I might not be following you, but there could be cfs_rq's with local runtime_remaining, so even if it goes 0 or negative, the group might still have quota left and so shouldn't be throttled right away. I was thinking the throttling would happen as normal, when a cfs_rq ran out of runtime_remaining and failed to refill it from cfs_bandwidth::runtime. > > + cfs_b->debt += debt - payment; > > + > > + raw_spin_unlock(&cfs_b->lock); > > + } > > + > > + tg = tg->parent; > > + } > > +} > > So part of the problem I have with this is that these external things > can consume all the bandwidth and basically indefinitely starve the > group. > > This is doulby so if you're going to account things like softirq network > processing. Yes. As Tejun points out, I'll make sure remote charging doesn't run away. > Also, why does the whole charging API have a task argument? It either is > current or NULL in case of things like softirq, neither really make > sense as an argument. @task distinguishes between NULL for softirq and current for everybody else. It's possible to detect this difference internally though, if that's what you're saying, so @task can go away. > Also, by virtue of this being a start-stop annotation interface, the > accrued time might be arbitrarily large and arbitrarily delayed. I'm not > sure that's sensible. Yes, that is a risk. With start-stop, users need to be careful to account often enough and have a "reasonable" upper bound on period length, however that's defined. Multithreaded jobs are probably the worst offender since these threads charge a sizable amount at once compared to the other use cases. > For tasks it might be better to mark the task and have the tick DTRT > instead of later trying to 'migrate' the time. Ok, I'll try that. The start-stop approach keeps remote charging from adding overhead in the tick for non-remote-charging things, far and away the common case, but I'll see how expensive the tick-based approach is. Can hide it behind a static branch for systems not using the cpu contoller.