Re: [RFC PATCH v3 10/10] sched/fair: Throttle CFS tasks on return to userspace

Valentin Schneider <vschneid@xxxxxxxxxx> · Tue, 23 Jul 2024 17:16:20 +0200

On 18/07/24 17:25, Benjamin Segall wrote:
> Valentin Schneider <vschneid@xxxxxxxxxx> writes:
>
>> As reported in [1], CFS bandwidth throttling is a source of headaches in
>> PREEMPT_RT - generally speaking, a throttled CFS task can hold locks that
>> prevent ksoftirqd from running, which prevents replenishing & unthrottling
>> the cfs_rq of said CFS task.
>>
>> Peter mentioned that there have been discussions on changing /when/ the
>> throttling happens: rather than have it be done immediately upon updating
>> the runtime statistics and realizing the cfs_rq has depleted its quota, we wait
>> for the task to be about to return to userspace.
>
> Sorry for taking a while to get to replying ot this.

Well it's a pretty big diff, so thanks for taking a look!

>
>> Clocks
>> ======
>>
>> Correctly handling the different cfs_rq->throttled_clock* is tricky, as
>> unlike the current upstream approach where all tasks of a cfs_rq are
>> throttled at the exact same time, here they each get throttled at a
>> per-task, not-known-beforehand time.
>>
>> For instance, for the ->throttled_clock_pelt, ideally we would need a
>> per-task snapshot of when the task gets really throttled in
>> exit_to_user_mode(), rather than a single snapshot of when the cfs_rq runs
>> out of runtime. This isn't implemented here. The ->throttled_clock_pelt is
>> set when the cfs_rq runs out of runtime, which means the "grace period"
>> given to the cfs_rq's tasks on their way to exit_to_user_mode() isn't
>> accounted.
>
> And I haven't checked it yet because I never remember how the whole
> throttled clock thing works. :P
>
>>
>> Notable behaviour changes
>> =========================
>>
>> Once a cfs_rq is ->throttled, its tasks can continue running until they hit
>> exit_to_user_mode(). This means they can keep draining further runtime
>> from their cfs_rq, which can end up draining more than one period's worth
>> of runtime.
>>
>> I've tested a 10ms runtime / 100ms period cgroup with an always running
>> task: upstream gets a "clean" periodic pattern of 10ms runtime every 100ms,
>> whereas this gets something more like 40ms runtime every 400ms.
>
> Hmm, this seems a little odd since TWA_RESUME does a kick_process.

I didn't ponder too much on the workload used here, but the point I wanted
to bring up is: if you give a cgroup X amount of runtime, it may still
consume more than that within a single period because execution in
kernelspace isn't immediately stopped/throttled.

It means the "standard" bandwidth control behaviour becomes a bit more
bursty.

>> +static void task_throttle_cancel_irq_work_fn(struct irq_work *work)
>> +{
>> +	struct task_struct *p = container_of(work, struct task_struct, unthrottle_irq_work);
>> +	int cpu = raw_smp_processor_id();
>> +
>> +	CLASS(task_rq_lock, rq_guard)(p);
>> +	WARN_ON_ONCE(task_cpu(p) != cpu);
>> +
>> +	if (task_has_throttle_work(p) && !task_needs_throttling(p))
>> +		task_throttle_do_cancel_work(p);
>> +}
>
> I think you can wind up triggering this WARN and in general have some
> weird state, whether or not it matters, basically any time that you
> __task_throttle_cancel(p, a_remote_cpu).
>
> It queues an irq_work and sends an IPI, but doesn't wait for it. If
> handling that IPI is delayed, then we can wind up doing more
> migration/class switch/etc (on this cpu or some third cpu) before that
> irq_work runs.
>
> I think this may be ok and it's just the WARN that's wrong, but I'm not
> sure.
>

That whole thing is indeed nasty, /me goes back through my notes
Right, so you can have:

  task_cpu(p) == CPU0

  CPU0                     CPU1

  rq_lock();
                           rq_lock();
  switched_from_fair()
    task_throttle_cancel()
    ...
  rq_unlock();
                           pull_rt_task()
                             set_task_cpu(p, CPU1);
                           rq_unlock();

  task_throttle_cancel_irq_work_fn()
    task_throttle_do_cancel()
      WARN_ON_ONCE(CPU1 != CPU0);

That does mean the task p could start executing on CPU1 before the
task_work is cleared, which is something I want to avoid - while very
unlikely, the worst case is it reaches exit_to_user_mode() before the
task_work gets cleared, which is a no-no.

I could add a sched_class check in throttle_cfs_rq_work(), but this is making
wonder if irq_work is the right mechanism here.

The one important thing for me is: if a task doesn't need throttling
anymore, then the dequeue_task_fair() in exit_to_user_mode() mustn't happen.
One option I'm seeing is forget about the irq_work, always re-check sched_class
in throttle_cfs_rq_work() (we already check cfs_rq->throttle_count aka
throttled_hierarchy()), and just do a kick_process() to ensure a kernel entry.

>> +	/*
>> +	 * Re-enqueue the tasks that have been throttled at this level.
>> +	 *
>> +	 * The task count is up-propagated via ->unthrottled_*h_nr_running,
>> +	 * as we can't purely rely on h_nr_running post-enqueue: the unthrottle
>> +	 * might happen when a cfs_rq still has some tasks enqueued, either still
>> +	 * making their way to userspace, or freshly migrated to it.
>> +	 */
>> +	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
>> +		struct sched_entity *pse = &p->se;
>> +
>> +		list_del_init(&p->throttle_node);
>> +
>> +		enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP);
>> +		task_delta++;
>> +		idle_task_delta += task_has_idle_policy(p);
>> +	}
>
> You know, on our too-large machines with too-many cgroups, we're already
> running into these walk_tg_trees being worrying slow for holding a spinlock :P
>

Yeah, for N throttled cgroups in a hierarchy, this is pretty much squashing
N rq-lock-held segments into one. Not great.

What I'm thinking is we could have N walks re-queuing the tasks of a single
cfs_rq each walk and updating the hierarchy, releasing the rq-lock between
each walk - instead of a single walk re-queueing up to N lists of
tasks. More operations but smaller rq-lock-held segments.

>> +
>> +	/*
>> +	 * Account tasks woken up in children; by this point all direct children
>> +	 * have been visited.
>> +	 */
>> +	task_delta += cfs_rq->unthrottled_h_nr_running;
>> +	idle_task_delta += cfs_rq->unthrottled_idle_h_nr_running;
>> +
>> +	cfs_rq->h_nr_running += task_delta;
>> +	cfs_rq->idle_h_nr_running += idle_task_delta;
>> +
>> +	/*
>> +	 * unthrottle_cfs_rq() needs a value to up-propagate above the
>> +	 * freshly unthrottled cfs_rq.
>> +	 */
>> +	cfs_rq->unthrottled_h_nr_running = task_delta;
>> +	cfs_rq->unthrottled_idle_h_nr_running = idle_task_delta;
>
> I think this should have no effect, right?

Hm so my thoughts here are:
The walk_tg_tree_from(tg_unthrottle_up) will update *nr_running counts up
to the cfs_rq->tg->se[cpu_of(rq)]. However if that cfs_rq isn't the root
one, we'll need the for_each_sched_entity() loop further down
unthrottle_cfs_rq() to update the upper part of the hierarchy. The values
that will be up-propagated there are the ones being saved here.

>
>> +
>> +	/* Accumulate the delta in the parent's stash. Once all its children
>> +	 * (i.e. all of this cfs_rq's siblings) have been visited, this value
>> +	 * will be stable and used for its own count update.
>> +	 */
>> +	pcfs_rq->unthrottled_h_nr_running += task_delta;
>> +	pcfs_rq->unthrottled_idle_h_nr_running += idle_task_delta;
>> +
>> +	/*
>> +	 * If the cfs_rq became empty during throttling, then we dequeued
>> +	 * it. It needs to be put back in the hierarchy if it or any of
>> +	 * its children have now-unthrottled tasks.
>> +	 */
>> +	if (!se->on_rq && (cfs_rq->h_nr_running || cfs_rq->idle_h_nr_running)) {
>> +		enqueue_entity(pcfs_rq, se, ENQUEUE_WAKEUP);
>> +	} else {
>> +		update_load_avg(pcfs_rq, se, UPDATE_TG);
>> +		se_update_runnable(se);
>>      }
>
> So I think this is trying to combine the all updates, and it's logically
> just the same as if the loop was enqueue_task_fair, like you mentioned
> in a followup for the throttle_one_task and dequeue_task_fair?
>

Good point, that should be a mirror of throttle_one_task() wrt the enqueueing.

>>  void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>> @@ -5922,25 +6152,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>>              goto unthrottle_throttle;
>>      }
>>
>> -	task_delta = cfs_rq->h_nr_running;
>> -	idle_task_delta = cfs_rq->idle_h_nr_running;
>> -	for_each_sched_entity(se) {
>> -		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>> -
>> -		if (se->on_rq)
>> -			break;
>> -		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
>> -
>> -		if (cfs_rq_is_idle(group_cfs_rq(se)))
>> -			idle_task_delta = cfs_rq->h_nr_running;
>> +	if (cfs_rq->throttle_count)
>> +		return;
>>
>> -		qcfs_rq->h_nr_running += task_delta;
>> -		qcfs_rq->idle_h_nr_running += idle_task_delta;
>> +	/*
>> +	 * cfs_rq's below us may not have been fully emptied out, so we can't rely
>> +	 * directly on ->h_nr_running. Use the stash instead.
>> +	 */
>> +	task_delta = cfs_rq->unthrottled_h_nr_running;
>> +	idle_task_delta = cfs_rq->unthrottled_idle_h_nr_running;
>>
>> -		/* end evaluation on encountering a throttled cfs_rq */
>> -		if (cfs_rq_throttled(qcfs_rq))
>> -			goto unthrottle_throttle;
>> -	}
>> +	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_clear_up, (void *)rq);
>>
>>      for_each_sched_entity(se) {
>>              struct cfs_rq *qcfs_rq = cfs_rq_of(se);
>
> I think this would be simpler by making the first/original
> walk_tg_tree_from be
>
> walk_tg_tree_from(cfs_rq->tg, tg_unthrottle_clear_down, tg_unthrottle_up, (void *)rq);
>
> (With the clear function being the same, just renamed)
>
> We never have any unthrottled* saved between calls to unthrottle_cfs_rq,
> because if throttled_count is set for us, it's set for all of our
> descendants too, so we're doing nothing but throttle_count stuff in that
> case. Sadly that doesn't let us remove the cfs_rq fields, that would
> need a walk_tg_tree_by_level.

Good point, I think that makes sense. Thanks!